The Price Reversal Phenomenon: When Cheaper AI Costs More
A 2026 Microsoft Research preprint found the cheaper-per-token AI model cost more to finish the job in 32% of model pairs. Why the sticker misleads.
By Capital & Compute
A new study put a number on something every team scaling AI has felt in its invoice: the cheapest model on the pricing page is often not the cheapest model to run. Across eight frontier reasoning models and twelve task suites, in 32% of head-to-head matchups the model with the lower listed price ended up costing more to finish the same work. The worst gap reached 28x.
The cleanest example in the paper: Google’s Gemini 3 Flash carries a list price 80% below GPT-5.4, yet across all twelve tasks it cost 38% more to actually complete them. Cheaper per token, more expensive per job.
The study: eight models, twelve tasks, one uncomfortable finding
The source is a 2026 preprint, The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More, authored by Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, and James Zou, and hosted as a Microsoft Research publication. It is a preprint, not yet peer-reviewed, so read the exact figures as the authors’ own measurements rather than settled fact. The method is the part that matters, and it is simple: take eight frontier reasoning models, run each across the same large set of tasks, and compare what each model’s listed per-token price implied against what the work actually cost once every token was counted.
The eight models span the current frontier: GPT-5.4 and GPT-5.4 Mini, Gemini 3.1 Pro and Gemini 3 Flash, Claude Opus 4.7 and Claude Haiku 4.5, Kimi K2.6, and MiniMax M2.7. The twelve task suites mix single-turn reasoning (competition math like AIME and LiveMathBench, science QA like GPQA and Humanity’s Last Exam, code generation on LiveCodeBench, plus ARC-AGI, ArenaHard, MMLU-Pro, and SimpleQA) with three multi-turn agentic suites (Terminal-Bench 2.0, Cybench, and GAIA). That breadth is the point: the reversal is not a quirk of one weird benchmark.
Those eight listed prices are the ladder a buyer actually shops from, and they span more than 20x from cheapest to dearest.
| Item | Value |
|---|---|
| Claude Opus 4.7 | $30.00 |
| GPT-5.4 | $17.50 |
| Gemini 3.1 Pro | $14.00 |
| Claude Haiku 4.5 | $6.00 |
| GPT-5.4 Mini | $5.25 |
| Kimi K2.6 | $4.95 |
| Gemini 3 Flash | $3.50 |
| MiniMax M2.7 | $1.50 |
The headline result: across all model pairs studied, 32% exhibit a price reversal, where the model with the lower listed price incurs the higher total cost. The largest reversal reached 28x. In other words, ranking models by their pricing page would have pointed you at the more expensive option in roughly one of every three comparisons.
The headline reversal: 80% cheaper on paper, 38% more in practice
Take the paper’s clearest pair. By listed price, Gemini 3 Flash ($3.50 per million tokens in the study’s blended figure) is 80% cheaper than GPT-5.4 ($17.50). Run the full twelve-suite gauntlet, though, and Gemini 3 Flash cost $705 against GPT-5.4’s $509. The “cheaper” model came in 38% higher on the bill that arrives at the end of the month.
| Item | Listed price | Actual cost to finish |
|---|---|---|
| Gemini 3 Flash | 0.20x | 1.38x |
| GPT-5.4 | 1.00x | 1.00x |
The gap is not rounding error. It is the difference between a budget you set from the pricing page and the number that actually clears.
Why it happens: you pay per token, models spend them differently
The mechanism is not exotic. You do not pay per question. You pay per token, and reasoning models emit two kinds of output: the visible answer, and a much larger stream of hidden “thinking” tokens they generate while working toward it. Those thinking tokens are billed at the output rate, and they dominate the bill. So your real cost is sticker price multiplied by tokens consumed, and consumption is the variable the pricing page never shows.
How big is the variation? On the same query, the study found one model can use 900% more thinking tokens than another, and on agentic tasks take 10x more turns of environment interaction. A model can be a fifth of the price per token and still spend its way past a rival because it thinks five times as long to reach the same answer. The pricing page shows you the first number and hides the second.
This is the same dynamic that makes a low per-token rate a poor predictor of an agent’s real bill. It is why cost per task, not cost per token, is the only honest unit for an agentic workload, and why tools that cut token consumption can move the bill more than switching to a cheaper-listed model.
See the reversal yourself. Rank today’s models by what they cost to finish the same task, and the order rarely matches the sticker prices in the ladder above:
Cost-per-task calculator
Modeled estimate- Cache reads: $0.405 (28%)
- Fresh input: $0.450 (31%)
- Output: $0.600 (41%)
A flat $10/mo plan (GitHub Copilot Pro) pays for itself above about 6.9 tasks/month at this cost-per-task. You are modeling 66. Below the break-even, pure usage billing is cheaper; above it, the subscription is.
| Model | Sticker (in/out) | Cost/task |
|---|---|---|
| DeepSeek V4 | $0.435/$0.87 | $0.105 |
| Gemini 3 Flash | $0.5/$3 | $0.262 |
| Claude Haiku 4.5 | $1/$5 | $0.485 |
| Kimi K2.7 Code | $0.95/$4 | $0.559 |
| GLM-5.2 | $1.4/$4.4 | $0.737 |
| Gemini 3.5 Flash | $1.5/$9 | $0.787 |
| Qwen3.7 Max | $2.5/$7.5 | $1.01 |
| Gemini 3.1 Pro | $2/$12 | $1.05 |
| Claude Sonnet 4.6 selected | $3/$15 | $1.46 |
| Claude Opus 4.8 | $5/$25 | $2.42 |
| GPT-5.5 | $5/$30 | $2.63 |
| Claude Fable 5 | $10/$50 | $4.85 |
Modeled from published per-token API rates and stated token assumptions, not a benchmark. Real cost varies with your codebase and how tightly you scope each request; the same task can swing by an order of magnitude between runs. Open the full cost-per-task calculator to adjust every assumption.
Your bill is not even stable
The reversal is the eye-catching finding. The quieter one is worse for anyone trying to forecast spend. The study found that repeated runs of the same query on the same model yielded thinking-token variation of up to 9.7x. Same prompt, same model, and the cost can land anywhere across nearly an order of magnitude depending on how long the model happened to reason that time.
So per-task cost is not a point. It is a distribution. Budgeting on the average understates what your heaviest tasks will do, and a run that thinks for ten times its usual length is not an outlier you can design away. It is a property of how reasoning models sample.
| Item | Most expensive run | Cheapest run |
|---|---|---|
| Same query, same model | 9.7x | 1.0x |
The preprint itself is a moving target, which only sharpens the point. An earlier version (March 2026) reported reversals in 21.8% of pairs across nine tasks and used GPT-5.2 as the comparison; the current version (May 2026) reports 32% across twelve tasks against GPT-5.4. The headline rate climbed as the test set grew. When the measured frequency of “cheaper costs more” rises every time someone adds tasks and re-runs the numbers, treating any single per-token comparison as settled is the error.
The list price is what a model advertises. The bill is what it does.
What to measure instead: cost to finish the work
If unit price is the wrong number, the right one is the total cost to complete a representative set of your own tasks, thinking tokens and retries included. The public version of this metric already exists. Artificial Analysis, an independent benchmarking firm, publishes a “cost to run the Intelligence Index” figure: the actual dollar cost for each model to complete its full benchmark suite, not a per-token rate. It is the same idea the preprint formalizes, and even models of nearly identical measured intelligence cost very different amounts to run.
| Item | Artificial Analysis Intelligence Index | Cost to run the Index (USD) |
|---|---|---|
| Fable 5 | 60 | $6.2K |
| Opus 4.8 | 56 | $3.7K |
| GPT-5.5 | 55 | $2.9K |
| Opus 4.7 | 54 | $4.4K |
The practical move follows directly. Before you commit to a model, run a representative slice of your real tasks through each candidate, measure the total bill, and rank on that. Do not rank on the pricing page. And measure it on your own workload, not a public leaderboard: as the benchmark numbers themselves show, the same task suite produces different results depending on the harness around it, so the only cost that predicts your bill is the one measured on your tasks, through your harness.
If you build on top of LLMs
Two consequences follow for anyone selling a product built on model calls, and both are about margin rather than benchmarks.
First, your cost of goods sold is not the sticker price. It is sticker price multiplied by consumption, and consumption is model-specific, variable, and partly random. A model you picked because the rate looked low can carry a higher true COGS than the one you rejected. The only way to know is to measure the finished-work cost, the way the study does, on the tasks your product actually runs.
Second, if you charge a flat fee on top of metered usage, the variance works against you in the dark. Your blended cost looks fine because light users subsidize the average, while your heaviest users quietly cross into negative margin on a number you do not control and cannot fully predict. Either price on measured cost per task with explicit headroom for the 9.7x tail, or meter usage through to the customer so their behavior, not your spreadsheet, carries the variance. This is the same discipline the rest of the 2026 cost-per-task picture rewards: the operators who win on unit economics are the ones measuring the bill, not the brochure.
The bottom line
Per-token price is a marketing surface. It is real, it is published, and it is close to useless on its own for predicting what a workload costs, because it omits the one variable that drives the bill: how many tokens a model burns to finish the job, which is large, model-specific, and unstable run to run. A model can be 80% cheaper on the page and 38% more expensive in the account. Across a third of frontier model pairs, the pricing page points the wrong way.
Treat the list price as the opening bid, not the cost. Measure cost to finish the work on your own tasks, plan against the expensive tail rather than the average, and you will price and provision off a number that actually clears. Everyone still picking models off the pricing page is, in about one case in three, optimizing for the wrong one.
Sources
- Chen, L., Zhang, C., He, Y., Stoica, I., Zaharia, M., & Zou, J. (2026). The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More. arXiv preprint (v2, May 2026). arxiv.org/abs/2603.23971
- Microsoft Research. (2026). The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More (publication page). microsoft.com/en-us/research/publication
- Artificial Analysis. (2026). Models: Intelligence, Performance & Price (cost to run the Intelligence Index). artificialanalysis.ai/models
- Artificial Analysis. (2026). Claude Fable 5 cost ~$6.2K to run the Intelligence Index [post]. x.com/ArtificialAnlys
- Artificial Analysis. (2026). MiniMax-M2.7 cost ~$175 to run the Intelligence Index [post]. x.com/ArtificialAnlys