Model review, May 2026: which AI to actually use this quarter
Five model families now matter for serious work: Anthropic Claude, OpenAI GPT, Google Gemini, Meta Llama, and the cost-leader pack (DeepSeek, Mistral, Qwen). Each has a thing it's genuinely best at and a thing it's genuinely worst at. Here's our short, opinionated map for the next quarter.
The 30-second decision table
| Workload | Default pick | Cheap-and-cheerful alternative |
|---|---|---|
| Customer chat (general) | Claude Haiku 4.5 | GPT-5 nano |
| Coding agents / refactors | Claude Sonnet 4.5 | DeepSeek V3.5 |
| Hard reasoning + maths | OpenAI o-series / GPT-5 | DeepSeek R1 |
| Long-document Q&A | Gemini 2.5 Pro | Claude Sonnet 4.5 |
| Image + screen understanding | GPT-5 / Gemini 2.5 Pro | Llama 4 Maverick |
| Bulk classification / extraction | Haiku 4.5 / GPT-5 nano | Llama 4 Scout (self-hosted) |
| EU data residency required | Mistral Large | Self-hosted Llama 4 |
Anthropic Claude (Haiku 4.5, Sonnet 4.5, Opus 4.5)
Claude has quietly become the default in a lot of agent and coding shops, including ours. It is calibrated - meaning when it doesn't know something, it tends to say so rather than confabulate - and the tool-use API is the cleanest in the market.
- Best at:agentic coding, structured output, refusal calibration (says “I'm not sure” when it isn't), long-context summarisation up to ~200K tokens.
- Weak at: live web search out of the box; vision is good but not class-leading.
- Pricing posture: Haiku 4.5 is the price-performance sweet spot - fast, cheap, surprisingly capable. Sonnet for serious reasoning, Opus for the hardest long-form work.
OpenAI (GPT-5, GPT-5 mini, GPT-5 nano, o-series)
Still the broadest ecosystem - the most plugins, the most third-party tools assume it works. GPT-5 is the strongest general-purpose model on multi-step reasoning when you let the o-series “think” mode burn extra tokens.
- Best at: hard reasoning with explicit chain-of-thought, multimodal (vision + audio in one model), ecosystem support.
- Weak at: price-performance at the mid-range - mini and nano are competitive but not the cheapest. Tends to over-comply with instructions you didn't mean.
- Pricing posture:Pay-as-you-go is fair; the “reasoning” modes can rack up tokens fast if you don't cap them.
Google Gemini (2.5 Pro, 2.5 Flash)
Gemini is the long-context king. If you need to throw an entire legal disclosure pack or a year of customer support transcripts into one prompt, Pro's effective recall across 1-2M tokens is unmatched. Flash is fast and cheap; great for high-volume classification.
- Best at: giant context windows, video understanding, deep Google Workspace integrations.
- Weak at: agent tool-use is workable but rougher than Claude; refusal patterns occasionally frustrating on legitimate business queries.
Meta Llama (4 Scout, 4 Maverick)
The strongest open-weights family. You can run these on your own hardware or rent them from any cloud, which solves the data-egress and vendor-lock anxieties that block bigger contracts.
- Best at: self-hosting, fine-tuning on private data, predictable inference cost at volume.
- Weak at: the absolute frontier of reasoning still belongs to closed models; deployment ops are your problem.
The cost-leaders: DeepSeek, Mistral, Qwen
DeepSeek V3.5 and R1 keep punching above their weight on reasoning benchmarks at a fraction of frontier prices. Mistral Large is the practical choice for European data-residency contracts. Qwen is a credible Chinese-led alternative if your deployment is APAC-first.
Cost snapshot
Indicative published prices in USD per million input tokens, for the cheapest tier in each family that's still useful for chat or extraction. Output tokens cost more (typically 4-5x).
How we choose, end-to-end
The honest answer is: we test. Benchmarks are useful as a starting filter and almost useless as a final answer. Our decision tree:
| Question | If yes | If no |
|---|---|---|
| Is the data sensitive enough that it can't leave the EU/UK? | Mistral Large or self-hosted Llama 4 | Continue |
| Does the workload involve giant single documents (>200K tokens)? | Gemini 2.5 Pro | Continue |
| Is this an agent that needs tool-use and refusal calibration? | Claude Sonnet 4.5 (or Haiku for cheap) | Continue |
| Is it bulk extraction / classification at high volume? | Haiku 4.5, GPT-5 nano, or DeepSeek V3.5 | Continue |
| Hard multi-step reasoning where wrong answers are expensive? | GPT-5 with reasoning mode, or Opus 4.5 | Default to Haiku 4.5 |
Where we'd default for a small UK business in May 2026
- Customer-facing chat: Claude Haiku 4.5.
- Internal agent / workflow automation: Claude Sonnet 4.5 with tool-use.
- Bulk doc processing: Gemini 2.5 Flash for cost, Pro for the gnarly ones.
- Anything sensitive: Mistral Large via EU-hosted endpoint, or Llama 4 self-hosted on a small GPU server.
Re-evaluate every quarter. The cheapest model that meets your bar today probably isn't the cheapest one tomorrow.
Insights · One email a month
Useful things, when there are useful things to say.
Plain-English notes on AI, automation, and bespoke software for UK SMEs. We don’t do drip campaigns. Unsubscribe in one click.
We only ask for your email if you’ve opted in to marketing cookies. That’s how we keep things tidy - one place to change your mind, any time.