Glasswork Analytics
All insights
May 2026Model Reviews11 min read

Model review, May 2026: which AI to actually use this quarter

Five model families now matter for serious work: Anthropic Claude, OpenAI GPT, Google Gemini, Meta Llama, and the cost-leader pack (DeepSeek, Mistral, Qwen). Each has a thing it's genuinely best at and a thing it's genuinely worst at. Here's our short, opinionated map for the next quarter.

The 30-second decision table

WorkloadDefault pickCheap-and-cheerful alternative
Customer chat (general)Claude Haiku 4.5GPT-5 nano
Coding agents / refactorsClaude Sonnet 4.5DeepSeek V3.5
Hard reasoning + mathsOpenAI o-series / GPT-5DeepSeek R1
Long-document Q&AGemini 2.5 ProClaude Sonnet 4.5
Image + screen understandingGPT-5 / Gemini 2.5 ProLlama 4 Maverick
Bulk classification / extractionHaiku 4.5 / GPT-5 nanoLlama 4 Scout (self-hosted)
EU data residency requiredMistral LargeSelf-hosted Llama 4

Anthropic Claude (Haiku 4.5, Sonnet 4.5, Opus 4.5)

Claude has quietly become the default in a lot of agent and coding shops, including ours. It is calibrated - meaning when it doesn't know something, it tends to say so rather than confabulate - and the tool-use API is the cleanest in the market.

  • Best at:agentic coding, structured output, refusal calibration (says “I'm not sure” when it isn't), long-context summarisation up to ~200K tokens.
  • Weak at: live web search out of the box; vision is good but not class-leading.
  • Pricing posture: Haiku 4.5 is the price-performance sweet spot - fast, cheap, surprisingly capable. Sonnet for serious reasoning, Opus for the hardest long-form work.

OpenAI (GPT-5, GPT-5 mini, GPT-5 nano, o-series)

Still the broadest ecosystem - the most plugins, the most third-party tools assume it works. GPT-5 is the strongest general-purpose model on multi-step reasoning when you let the o-series “think” mode burn extra tokens.

  • Best at: hard reasoning with explicit chain-of-thought, multimodal (vision + audio in one model), ecosystem support.
  • Weak at: price-performance at the mid-range - mini and nano are competitive but not the cheapest. Tends to over-comply with instructions you didn't mean.
  • Pricing posture:Pay-as-you-go is fair; the “reasoning” modes can rack up tokens fast if you don't cap them.

Google Gemini (2.5 Pro, 2.5 Flash)

Gemini is the long-context king. If you need to throw an entire legal disclosure pack or a year of customer support transcripts into one prompt, Pro's effective recall across 1-2M tokens is unmatched. Flash is fast and cheap; great for high-volume classification.

  • Best at: giant context windows, video understanding, deep Google Workspace integrations.
  • Weak at: agent tool-use is workable but rougher than Claude; refusal patterns occasionally frustrating on legitimate business queries.

Meta Llama (4 Scout, 4 Maverick)

The strongest open-weights family. You can run these on your own hardware or rent them from any cloud, which solves the data-egress and vendor-lock anxieties that block bigger contracts.

  • Best at: self-hosting, fine-tuning on private data, predictable inference cost at volume.
  • Weak at: the absolute frontier of reasoning still belongs to closed models; deployment ops are your problem.

The cost-leaders: DeepSeek, Mistral, Qwen

DeepSeek V3.5 and R1 keep punching above their weight on reasoning benchmarks at a fraction of frontier prices. Mistral Large is the practical choice for European data-residency contracts. Qwen is a credible Chinese-led alternative if your deployment is APAC-first.

Cost snapshot

Indicative published prices in USD per million input tokens, for the cheapest tier in each family that's still useful for chat or extraction. Output tokens cost more (typically 4-5x).

Llama 4 Scout (self-hosted)$0.06
DeepSeek V3.5$0.14
GPT-5 nano$0.15
Claude Haiku 4.5$0.25
Gemini 2.5 Flash$0.30
Mistral Large$2.00
Claude Sonnet 4.5$3.00
GPT-5$3.50
Gemini 2.5 Pro$4.00
Claude Opus 4.5$15.00
Approximate input pricing, May 2026. Always check the live pricing page before committing - vendors discount frequently.

How we choose, end-to-end

The honest answer is: we test. Benchmarks are useful as a starting filter and almost useless as a final answer. Our decision tree:

QuestionIf yesIf no
Is the data sensitive enough that it can't leave the EU/UK?Mistral Large or self-hosted Llama 4Continue
Does the workload involve giant single documents (>200K tokens)?Gemini 2.5 ProContinue
Is this an agent that needs tool-use and refusal calibration?Claude Sonnet 4.5 (or Haiku for cheap)Continue
Is it bulk extraction / classification at high volume?Haiku 4.5, GPT-5 nano, or DeepSeek V3.5Continue
Hard multi-step reasoning where wrong answers are expensive?GPT-5 with reasoning mode, or Opus 4.5Default to Haiku 4.5
5
model families
worth your time
10x
price spread
for similar quality on easy tasks
Weekly
rankings shift
re-test before renewal

Where we'd default for a small UK business in May 2026

  • Customer-facing chat: Claude Haiku 4.5.
  • Internal agent / workflow automation: Claude Sonnet 4.5 with tool-use.
  • Bulk doc processing: Gemini 2.5 Flash for cost, Pro for the gnarly ones.
  • Anything sensitive: Mistral Large via EU-hosted endpoint, or Llama 4 self-hosted on a small GPU server.

Re-evaluate every quarter. The cheapest model that meets your bar today probably isn't the cheapest one tomorrow.

Insights · One email a month

Useful things, when there are useful things to say.

Plain-English notes on AI, automation, and bespoke software for UK SMEs. We don’t do drip campaigns. Unsubscribe in one click.

We only ask for your email if you’ve opted in to marketing cookies. That’s how we keep things tidy - one place to change your mind, any time.