Skip to content
Capital & Compute
Leaderboard· Updated June 27, 2026

The AI model value leaderboard

Most rankings tell you which model scores highest. This one also tells you which model is worth the money. Each LLM is rated two ways: its independent benchmark score, and its value, the points it buys you per dollar of tokens. The two orders are not the same.

Which AI model is the best value right now?

On benchmark scores alone, the strongest LLM you can buy here is Claude Opus 4.8 (74.3 of 100 on the coding composite). But cost flips the ranking: MiniMax M3 delivers about 111.6 coding points per dollar of blended token price, roughly 15.1x the value of Claude Opus 4.8. The cheaper models, mostly from Chinese labs, win on value; the priciest US flagships win on raw capability. Which one is "best" depends entirely on whether you are buying ability or buying ability per dollar.

MiniMax M3
Best value, coding
111.6 coding points per dollar (blended)
Claude Opus 4.8
Top coding score (buyable)
74.3 of 100 on the coding composite
$0.50
Cheapest to run
MiniMax M3, blended (3:1) price per Mtok
21
Models tracked
12 scored on coding, 21 on intelligence

The value ladder, in one chart

Coding points per dollar of blended token price, for the models you can buy today. The ranking is almost the inverse of the raw-score ranking: the cheapest capable models sit at the top because they score within range of the frontier at a fraction of the price.

AI model coding value: composite coding score per dollar of blended token priceA lollipop chart ranking buyable models by coding value, defined as coding composite score divided by blended token price. The cheaper open-weight models rank highest; the most expensive flagship ranks lowest.0.020.040.060.080.0100.0120.0MiniMax M3111.6Qwen3.7 Plus99.8Nemotron 3 Ultra36.5Kimi K2.7 Code35.5Devstral 234.8GLM-5.232.0Llama 4 Maverick31.8Grok 4.322.5Qwen3.7 Max17.6Gemini 3.1 Pro15.3Claude Opus 4.87.4
AI model coding value: composite coding score per dollar of blended token price
ItemValue
MiniMax M3111.6
Qwen3.7 Plus99.8
Nemotron 3 Ultra36.5
Kimi K2.7 Code35.5
Devstral 234.8
GLM-5.232.0
Llama 4 Maverick31.8
Grok 4.322.5
Qwen3.7 Max17.6
Gemini 3.1 Pro15.3
Claude Opus 4.87.4
Coding value (composite score divided by blended token price) for buyable models. Higher is more coding ability per dollar. Blended price weights input to output 3 to 1.Source: Price Per Token (composite scores) and Capital & Compute verified pricing

The same picture, on general intelligence

Coding has no score for every model, but the broader intelligence composite does, so this ladder includes all the buyable models, DeepSeek, the GPT-5 tiers, Grok, Sonnet and Haiku among them. The story holds: the cheap models lead on value, and the priciest flagships (GPT-5.2 Pro, Claude Opus 4.8, Grok 4) fall to the bottom, where a top score cannot outrun a high token price.

AI model intelligence value: composite intelligence score per dollar of blended token priceA lollipop chart ranking every buyable model by intelligence value, defined as intelligence composite score divided by blended token price. Cheap models such as MiniMax M3, Qwen3.7 Plus and DeepSeek V4 rank highest; expensive flagships such as GPT-5.2 Pro rank lowest.0.020.040.060.080.0100.0MiniMax M384.6Qwen3.7 Plus69.6DeepSeek V457.4Nemotron 3 Ultra28.0Llama 4 Maverick27.9Kimi K2.7 Code24.5GLM-5.223.8Devstral 221.3Grok 4.315.9Qwen3.7 Max12.3Claude Haiku 4.511.9Gemini 3.1 Pro10.3Gemini 3.5 Flash10.3GPT-5.3 Codex9.2Gemini 3 Pro Preview7.4Claude Sonnet 4.65.7Claude Opus 4.85.6Grok 45.6GPT-5.25.4GPT-5.2 Pro0.7
AI model intelligence value: composite intelligence score per dollar of blended token price
ItemValue
MiniMax M384.6
Qwen3.7 Plus69.6
DeepSeek V457.4
Nemotron 3 Ultra28.0
Llama 4 Maverick27.9
Kimi K2.7 Code24.5
GLM-5.223.8
Devstral 221.3
Grok 4.315.9
Qwen3.7 Max12.3
Claude Haiku 4.511.9
Gemini 3.1 Pro10.3
Gemini 3.5 Flash10.3
GPT-5.3 Codex9.2
Gemini 3 Pro Preview7.4
Claude Sonnet 4.65.7
Claude Opus 4.85.6
Grok 45.6
GPT-5.25.4
GPT-5.2 Pro0.7
Intelligence value (composite score divided by blended token price) for every buyable model. Higher is more measured reasoning per dollar. The expensive flagships rank low here despite strong raw scores.Source: Price Per Token (composite scores) and Capital & Compute / source-dataset pricing

Rank it yourself

Switch the benchmark between coding and general intelligence, and switch the sort between value and raw score. The default view is coding, ranked by value.

Best LLM by coding, ranked by value

Composite of code generation, understanding, and problem-solving (0-100). Value is coding points per dollar of blended token price.

Live ranking
Benchmark

Coding: code-writing and problem-solving. Intelligence: general reasoning. Both are 0-100 composite scores.

Sort by

Value: points per dollar (best bang for the buck). Raw score: highest benchmark, price aside.

What is value? It is how many coding points you get per dollar of tokens: coding score ÷ blended price, where blended price = (3 × input + output) ÷ 4 per million tokens (input is weighted higher because coding agents read far more than they write). A higher value means more measured ability per dollar; it is a cost-efficiency measure, not a verdict on which model is best.
#ModelCodingInput $/MtokOutput $/MtokValue (pts/$)
1MiniMax M3 MiniMax58.6$0.30$1.20111.6 Best value
2Qwen3.7 Plus Alibaba55.9$0.32$1.2899.8
3Nemotron 3 Ultra Nvidia49.3$0.60$3.6036.5
4Kimi K2.7 Code Moonshot60.8$0.95$435.5
5Devstral 2 Mistral31.3$0.90$0.9034.8
6GLM-5.2 Zhipu68.8$1.40$4.4032.0
7Llama 4 Maverick Meta16.3$0.35$131.8
8Grok 4.3 xAI35.2$1.25$2.5022.5
9Qwen3.7 Max Alibaba66.0$2.50$7.5017.6
10Gemini 3.1 Pro Google68.8$2$1215.3
11Claude Opus 4.8 Anthropic74.3$5$257.4
12Claude Fable 5 Anthropic· unavailable76.5$10$503.8

Switch the benchmark or sort to re-rank. Scores are 0-100 composites from the source dataset. A ✓ next to the provider marks a price reconciled with our verified registry against the provider source; the rest are the author-direct or endpoint rate reported by the source dataset, not yet independently re-verified.

How to read this leaderboard

Two grounded inputs, one derived number. The benchmark scores are composite Coding and Intelligence scores read from the Price Per Token dataset, which aggregates independent benchmarks and cites Artificial Analysis, the HuggingFace Open LLM Leaderboard, and LayerLens. They are composites on a 0-100 scale, not a single named test. The prices carry a check when they are reconciled with this site's own verified registry, the same numbers behind the model release tracker; the rest are the author-direct or endpoint rate reported by the source dataset, labeled per row and not yet independently re-verified. Each links to its provider source.

From those two, the leaderboard computes value: the benchmark score divided by a blended token price, where blended price weights input and output tokens 3 to 1, ((3 × input) + output) ÷ 4. The 3:1 mix mirrors how an agentic coding session actually bills: it reads far more context than it writes. Value is a cost-efficiency measure, not a verdict on quality. A model can top the value ranking and still be the wrong choice for work that needs the highest absolute score. It is the same lesson as the price reversal in per-task cost: the headline number and the number that matters are rarely the same.

The independence rule

This leaderboard sells no placement. Ranking is never for sale, and no model is promoted for payment. The entire point of a value table is to be a neutral referee of what each model actually costs to use; the day a vendor could pay to look cheaper, it would be worthless. The benchmark scores come from an independent third party; the prices come from primary provider sources.

What is not here, and why

A model is listed only once an independent composite score exists for it, so the table is not padded with vendor-reported numbers. That leaves a few notable models tracked but not yet ranked:

  • GPT-5.5 / GPT-5.6 Sol, Terra, Luna. OpenAI's newest flagships. GPT-5.6 was previewed June 26, 2026 in a limited, US-government-gated rollout, and GPT-5.5 has no composite in the source dataset yet, so neither is scored. GPT-5.2 Pro and GPT-5.3 Codex represent OpenAI on the board for now.
  • Cohere North Mini Code. Free on hosted endpoints and open-weight, so a per-token value score is undefined. It posts a 33.4 Artificial Analysis Coding Index; the real cost is self-hosted compute, not a token rate.
  • Smaller and older variants. Models below roughly 10B parameters, superseded 2024-era releases (Claude 3.5, GPT-4 Turbo, o1), and narrowly tracked or unpriced entries are left off to keep the board to current, recognizable, buyable models.

For per-token rates and release dates across every model the site follows, see the AI model release tracker. To turn these rates into the cost of a real job, use the cost-per-task calculator or put two models head to head. To pay nothing at all, see which AI models are free to use and good enough to ship with.

Frequently asked questions

What is the best LLM for coding in 2026?
Among models you can actually buy, Claude Opus 4.8 posts the highest coding composite in this dataset, 74.3 of 100. (Claude Fable 5 scores higher at 76.5 but is suspended worldwide under a US export-control directive, so it is listed for reference only.) On a value basis, points bought per dollar, cheaper models such as MiniMax M3 lead instead, because they score within range of the frontier at a fraction of the token price.
What is the best value AI model?
On the coding composite, MiniMax M3 is the value leader at about 111.6 points per dollar of blended price, roughly 15.1 times Claude Opus 4.8. Other low-cost models (Qwen3.7 Plus, DeepSeek V4, Kimi K2.7) cluster near the top too. Value rewards low price, so it favors capable cheap models over the most expensive flagships: GPT-5.2 Pro, at $21/$168 per Mtok, lands last on value despite a top-tier score.
How is the value score calculated?
Value equals the benchmark composite score divided by a blended token price. The blended price weights input and output tokens 3 to 1: (3 times input + output) divided by 4, in dollars per million tokens. The 3:1 mix reflects agentic coding, which reads far more context than it writes. A higher value means more measured ability per dollar; it is a cost-efficiency measure, not a quality ranking on its own.
Where do the benchmark scores come from?
The composite Coding and Intelligence scores are read from the Price Per Token dataset, which aggregates independent benchmarks and cites Artificial Analysis, the HuggingFace Open LLM Leaderboard, and LayerLens. They are composite scores on a 0-100 scale, not a single named test. Prices marked with a check are reconciled with this site's own verified registry against the provider source; the rest are the author-direct or endpoint rate reported by the source dataset, labeled per row and not yet independently re-verified. This leaderboard sells no placement: ranking is never for sale.
Why are GPT-5.5 and GPT-5.6 not on the leaderboard?
They are tracked on the model registry but not yet scored. GPT-5.6 (Sol, Terra, Luna) was previewed on June 26, 2026 in a limited, US-government-gated rollout, and GPT-5.5 has no composite in the source benchmark dataset yet. A model is added here only once an independent composite score exists for it, so the ranking is not padded with vendor-reported numbers.

Sources

  • Price Per Token. LLM API Pricing and Benchmarks dataset (composite Coding and Intelligence scores). Scores read 2026-06-27. https://pricepertoken.com/
  • Artificial Analysis. Independent LLM benchmarks and intelligence index (cited by the source dataset as a benchmark origin). https://artificialanalysis.ai/
  • Capital & Compute. AI model registry (verified per-token API prices, each linked to a provider source). /ai-models/

Machine-readable data: /ai-model-leaderboard.json.

← Back to Capital & Compute