May 2026Model Reviews11 min read

Model review, May 2026: which AI to actually use this quarter

Five model families now matter for serious work: Anthropic Claude, OpenAI GPT, Google Gemini, Meta Llama, and the cost-leader pack (DeepSeek, Mistral, Qwen). Each has a thing it's genuinely best at and a thing it's genuinely worst at. Here's our short, opinionated map for the next quarter.

The 30-second decision table

Workload	Default pick	Cheap-and-cheerful alternative
Customer chat (general)	Claude Haiku 4.5	GPT-5 nano
Coding agents / refactors	Claude Sonnet 4.5	DeepSeek V3.5
Hard reasoning + maths	OpenAI o-series / GPT-5	DeepSeek R1
Long-document Q&A	Gemini 2.5 Pro	Claude Sonnet 4.5
Image + screen understanding	GPT-5 / Gemini 2.5 Pro	Llama 4 Maverick
Bulk classification / extraction	Haiku 4.5 / GPT-5 nano	Llama 4 Scout (self-hosted)
EU data residency required	Mistral Large	Self-hosted Llama 4

Anthropic Claude (Haiku 4.5, Sonnet 4.5, Opus 4.5)

Claude has quietly become the default in a lot of agent and coding shops, including ours. It is calibrated - meaning when it doesn't know something, it tends to say so rather than confabulate - and the tool-use API is the cleanest in the market.

Best at:agentic coding, structured output, refusal calibration (says “I'm not sure” when it isn't), long-context summarisation up to ~200K tokens.
Weak at: live web search out of the box; vision is good but not class-leading.
Pricing posture: Haiku 4.5 is the price-performance sweet spot - fast, cheap, surprisingly capable. Sonnet for serious reasoning, Opus for the hardest long-form work.

OpenAI (GPT-5, GPT-5 mini, GPT-5 nano, o-series)

Still the broadest ecosystem - the most plugins, the most third-party tools assume it works. GPT-5 is the strongest general-purpose model on multi-step reasoning when you let the o-series “think” mode burn extra tokens.

Best at: hard reasoning with explicit chain-of-thought, multimodal (vision + audio in one model), ecosystem support.
Weak at: price-performance at the mid-range - mini and nano are competitive but not the cheapest. Tends to over-comply with instructions you didn't mean.
Pricing posture:Pay-as-you-go is fair; the “reasoning” modes can rack up tokens fast if you don't cap them.

Google Gemini (2.5 Pro, 2.5 Flash)

Gemini is the long-context king. If you need to throw an entire legal disclosure pack or a year of customer support transcripts into one prompt, Pro's effective recall across 1-2M tokens is unmatched. Flash is fast and cheap; great for high-volume classification.

Best at: giant context windows, video understanding, deep Google Workspace integrations.
Weak at: agent tool-use is workable but rougher than Claude; refusal patterns occasionally frustrating on legitimate business queries.

Meta Llama (4 Scout, 4 Maverick)

The strongest open-weights family. You can run these on your own hardware or rent them from any cloud, which solves the data-egress and vendor-lock anxieties that block bigger contracts.

Best at: self-hosting, fine-tuning on private data, predictable inference cost at volume.
Weak at: the absolute frontier of reasoning still belongs to closed models; deployment ops are your problem.

The cost-leaders: DeepSeek, Mistral, Qwen

DeepSeek V3.5 and R1 keep punching above their weight on reasoning benchmarks at a fraction of frontier prices. Mistral Large is the practical choice for European data-residency contracts. Qwen is a credible Chinese-led alternative if your deployment is APAC-first.

Cost snapshot

Indicative published prices in USD per million input tokens, for the cheapest tier in each family that's still useful for chat or extraction. Output tokens cost more (typically 4-5x).

Llama 4 Scout (self-hosted)$0.06

DeepSeek V3.5$0.14

GPT-5 nano$0.15

Claude Haiku 4.5$0.25

Gemini 2.5 Flash$0.30

Mistral Large$2.00

Claude Sonnet 4.5$3.00

GPT-5$3.50

Gemini 2.5 Pro$4.00

Claude Opus 4.5$15.00

Approximate input pricing, May 2026. Always check the live pricing page before committing - vendors discount frequently.

How we choose, end-to-end

The honest answer is: we test. Benchmarks are useful as a starting filter and almost useless as a final answer. Our decision tree:

Question	If yes	If no
Is the data sensitive enough that it can't leave the EU/UK?	Mistral Large or self-hosted Llama 4	Continue
Does the workload involve giant single documents (>200K tokens)?	Gemini 2.5 Pro	Continue
Is this an agent that needs tool-use and refusal calibration?	Claude Sonnet 4.5 (or Haiku for cheap)	Continue
Is it bulk extraction / classification at high volume?	Haiku 4.5, GPT-5 nano, or DeepSeek V3.5	Continue
Hard multi-step reasoning where wrong answers are expensive?	GPT-5 with reasoning mode, or Opus 4.5	Default to Haiku 4.5

model families

worth your time

10x

price spread

for similar quality on easy tasks

Weekly

rankings shift

re-test before renewal

Where we'd default for a small UK business in May 2026

Customer-facing chat: Claude Haiku 4.5.
Internal agent / workflow automation: Claude Sonnet 4.5 with tool-use.
Bulk doc processing: Gemini 2.5 Flash for cost, Pro for the gnarly ones.
Anything sensitive: Mistral Large via EU-hosted endpoint, or Llama 4 self-hosted on a small GPU server.

Re-evaluate every quarter. The cheapest model that meets your bar today probably isn't the cheapest one tomorrow.