The AI model value leaderboard
Most rankings tell you which model scores highest. This one also tells you which model is worth the money. Each LLM is rated two ways: its independent benchmark score, and its value, the points it buys you per dollar of tokens. The two orders are not the same.
Which AI model is the best value right now?
On benchmark scores alone, the strongest LLM you can buy here is Claude Opus 4.8 (74.3 of 100 on the coding composite). But cost flips the ranking: MiniMax M3 delivers about 111.6 coding points per dollar of blended token price, roughly 15.1x the value of Claude Opus 4.8. The cheaper models, mostly from Chinese labs, win on value; the priciest US flagships win on raw capability. Which one is "best" depends entirely on whether you are buying ability or buying ability per dollar.
The value ladder, in one chart
Coding points per dollar of blended token price, for the models you can buy today. The ranking is almost the inverse of the raw-score ranking: the cheapest capable models sit at the top because they score within range of the frontier at a fraction of the price.
| Item | Value |
|---|---|
| MiniMax M3 | 111.6 |
| Qwen3.7 Plus | 99.8 |
| Nemotron 3 Ultra | 36.5 |
| Kimi K2.7 Code | 35.5 |
| Devstral 2 | 34.8 |
| GLM-5.2 | 32.0 |
| Llama 4 Maverick | 31.8 |
| Grok 4.3 | 22.5 |
| Qwen3.7 Max | 17.6 |
| Gemini 3.1 Pro | 15.3 |
| Claude Opus 4.8 | 7.4 |
The same picture, on general intelligence
Coding has no score for every model, but the broader intelligence composite does, so this ladder includes all the buyable models, DeepSeek, the GPT-5 tiers, Grok, Sonnet and Haiku among them. The story holds: the cheap models lead on value, and the priciest flagships (GPT-5.2 Pro, Claude Opus 4.8, Grok 4) fall to the bottom, where a top score cannot outrun a high token price.
| Item | Value |
|---|---|
| MiniMax M3 | 84.6 |
| Qwen3.7 Plus | 69.6 |
| DeepSeek V4 | 57.4 |
| Nemotron 3 Ultra | 28.0 |
| Llama 4 Maverick | 27.9 |
| Kimi K2.7 Code | 24.5 |
| GLM-5.2 | 23.8 |
| Devstral 2 | 21.3 |
| Grok 4.3 | 15.9 |
| Qwen3.7 Max | 12.3 |
| Claude Haiku 4.5 | 11.9 |
| Gemini 3.1 Pro | 10.3 |
| Gemini 3.5 Flash | 10.3 |
| GPT-5.3 Codex | 9.2 |
| Gemini 3 Pro Preview | 7.4 |
| Claude Sonnet 4.6 | 5.7 |
| Claude Opus 4.8 | 5.6 |
| Grok 4 | 5.6 |
| GPT-5.2 | 5.4 |
| GPT-5.2 Pro | 0.7 |
Rank it yourself
Switch the benchmark between coding and general intelligence, and switch the sort between value and raw score. The default view is coding, ranked by value.
Best LLM by coding, ranked by value
Composite of code generation, understanding, and problem-solving (0-100). Value is coding points per dollar of blended token price.
| # | Model | Coding | Input $/Mtok | Output $/Mtok | Value (pts/$) |
|---|---|---|---|---|---|
| 1 | MiniMax M3 MiniMax | 58.6 | $0.30 | $1.20 | 111.6 Best value |
| 2 | Qwen3.7 Plus Alibaba | 55.9 | $0.32 | $1.28 | 99.8 |
| 3 | Nemotron 3 Ultra Nvidia | 49.3 | $0.60 | $3.60 | 36.5 |
| 4 | Kimi K2.7 Code Moonshot✓ | 60.8 | $0.95 | $4 | 35.5 |
| 5 | Devstral 2 Mistral | 31.3 | $0.90 | $0.90 | 34.8 |
| 6 | GLM-5.2 Zhipu✓ | 68.8 | $1.40 | $4.40 | 32.0 |
| 7 | Llama 4 Maverick Meta | 16.3 | $0.35 | $1 | 31.8 |
| 8 | Grok 4.3 xAI | 35.2 | $1.25 | $2.50 | 22.5 |
| 9 | Qwen3.7 Max Alibaba✓ | 66.0 | $2.50 | $7.50 | 17.6 |
| 10 | Gemini 3.1 Pro Google✓ | 68.8 | $2 | $12 | 15.3 |
| 11 | Claude Opus 4.8 Anthropic✓ | 74.3 | $5 | $25 | 7.4 |
| 12 | Claude Fable 5 Anthropic✓· unavailable | 76.5 | $10 | $50 | 3.8 |
Switch the benchmark or sort to re-rank. Scores are 0-100 composites from the source dataset. A ✓ next to the provider marks a price reconciled with our verified registry against the provider source; the rest are the author-direct or endpoint rate reported by the source dataset, not yet independently re-verified.
How to read this leaderboard
Two grounded inputs, one derived number. The benchmark scores are composite Coding and Intelligence scores read from the Price Per Token dataset, which aggregates independent benchmarks and cites Artificial Analysis, the HuggingFace Open LLM Leaderboard, and LayerLens. They are composites on a 0-100 scale, not a single named test. The prices carry a check when they are reconciled with this site's own verified registry, the same numbers behind the model release tracker; the rest are the author-direct or endpoint rate reported by the source dataset, labeled per row and not yet independently re-verified. Each links to its provider source.
From those two, the leaderboard computes value: the benchmark score divided by a blended token price, where blended price weights input and output tokens 3 to 1, ((3 × input) + output) ÷ 4. The 3:1 mix mirrors how an agentic coding session actually bills: it reads far more context than it writes. Value is a cost-efficiency measure, not a verdict on quality. A model can top the value ranking and still be the wrong choice for work that needs the highest absolute score. It is the same lesson as the price reversal in per-task cost: the headline number and the number that matters are rarely the same.
The independence rule
This leaderboard sells no placement. Ranking is never for sale, and no model is promoted for payment. The entire point of a value table is to be a neutral referee of what each model actually costs to use; the day a vendor could pay to look cheaper, it would be worthless. The benchmark scores come from an independent third party; the prices come from primary provider sources.
What is not here, and why
A model is listed only once an independent composite score exists for it, so the table is not padded with vendor-reported numbers. That leaves a few notable models tracked but not yet ranked:
- GPT-5.5 / GPT-5.6 Sol, Terra, Luna. OpenAI's newest flagships. GPT-5.6 was previewed June 26, 2026 in a limited, US-government-gated rollout, and GPT-5.5 has no composite in the source dataset yet, so neither is scored. GPT-5.2 Pro and GPT-5.3 Codex represent OpenAI on the board for now.
- Cohere North Mini Code. Free on hosted endpoints and open-weight, so a per-token value score is undefined. It posts a 33.4 Artificial Analysis Coding Index; the real cost is self-hosted compute, not a token rate.
- Smaller and older variants. Models below roughly 10B parameters, superseded 2024-era releases (Claude 3.5, GPT-4 Turbo, o1), and narrowly tracked or unpriced entries are left off to keep the board to current, recognizable, buyable models.
For per-token rates and release dates across every model the site follows, see the AI model release tracker. To turn these rates into the cost of a real job, use the cost-per-task calculator or put two models head to head. To pay nothing at all, see which AI models are free to use and good enough to ship with.
Frequently asked questions
- What is the best LLM for coding in 2026?
- Among models you can actually buy, Claude Opus 4.8 posts the highest coding composite in this dataset, 74.3 of 100. (Claude Fable 5 scores higher at 76.5 but is suspended worldwide under a US export-control directive, so it is listed for reference only.) On a value basis, points bought per dollar, cheaper models such as MiniMax M3 lead instead, because they score within range of the frontier at a fraction of the token price.
- What is the best value AI model?
- On the coding composite, MiniMax M3 is the value leader at about 111.6 points per dollar of blended price, roughly 15.1 times Claude Opus 4.8. Other low-cost models (Qwen3.7 Plus, DeepSeek V4, Kimi K2.7) cluster near the top too. Value rewards low price, so it favors capable cheap models over the most expensive flagships: GPT-5.2 Pro, at $21/$168 per Mtok, lands last on value despite a top-tier score.
- How is the value score calculated?
- Value equals the benchmark composite score divided by a blended token price. The blended price weights input and output tokens 3 to 1: (3 times input + output) divided by 4, in dollars per million tokens. The 3:1 mix reflects agentic coding, which reads far more context than it writes. A higher value means more measured ability per dollar; it is a cost-efficiency measure, not a quality ranking on its own.
- Where do the benchmark scores come from?
- The composite Coding and Intelligence scores are read from the Price Per Token dataset, which aggregates independent benchmarks and cites Artificial Analysis, the HuggingFace Open LLM Leaderboard, and LayerLens. They are composite scores on a 0-100 scale, not a single named test. Prices marked with a check are reconciled with this site's own verified registry against the provider source; the rest are the author-direct or endpoint rate reported by the source dataset, labeled per row and not yet independently re-verified. This leaderboard sells no placement: ranking is never for sale.
- Why are GPT-5.5 and GPT-5.6 not on the leaderboard?
- They are tracked on the model registry but not yet scored. GPT-5.6 (Sol, Terra, Luna) was previewed on June 26, 2026 in a limited, US-government-gated rollout, and GPT-5.5 has no composite in the source benchmark dataset yet. A model is added here only once an independent composite score exists for it, so the ranking is not padded with vendor-reported numbers.
Sources
- Price Per Token. LLM API Pricing and Benchmarks dataset (composite Coding and Intelligence scores). Scores read 2026-06-27. https://pricepertoken.com/
- Artificial Analysis. Independent LLM benchmarks and intelligence index (cited by the source dataset as a benchmark origin). https://artificialanalysis.ai/
- Capital & Compute. AI model registry (verified per-token API prices, each linked to a provider source). /ai-models/
Machine-readable data: /ai-model-leaderboard.json.