The Price Reversal Phenomenon: When Cheaper AI Costs More

A new study put a number on something every team scaling AI has felt in its invoice: the cheapest model on the pricing page is often not the cheapest model to run. Across eight frontier reasoning models and twelve task suites, in 32% of head-to-head matchups the model with the lower listed price ended up costing more to finish the same work. The worst gap reached 28x.

The cleanest example in the paper: Google’s Gemini 3 Flash carries a list price 80% below GPT-5.4, yet across all twelve tasks it cost 38% more to actually complete them. Cheaper per token, more expensive per job.

The study: eight models, twelve tasks, one uncomfortable finding

The source is a 2026 preprint, The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More, authored by Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, and James Zou, and hosted as a Microsoft Research publication. It is a preprint, not yet peer-reviewed, so read the exact figures as the authors’ own measurements rather than settled fact. The method is the part that matters, and it is simple: take eight frontier reasoning models, run each across the same large set of tasks, and compare what each model’s listed per-token price implied against what the work actually cost once every token was counted.

The eight models span the current frontier: GPT-5.4 and GPT-5.4 Mini, Gemini 3.1 Pro and Gemini 3 Flash, Claude Opus 4.7 and Claude Haiku 4.5, Kimi K2.6, and MiniMax M2.7. The twelve task suites mix single-turn reasoning (competition math like AIME and LiveMathBench, science QA like GPQA and Humanity’s Last Exam, code generation on LiveCodeBench, plus ARC-AGI, ArenaHard, MMLU-Pro, and SimpleQA) with three multi-turn agentic suites (Terminal-Bench 2.0, Cybench, and GAIA). That breadth is the point: the reversal is not a quirk of one weird benchmark.

Those eight listed prices are the ladder a buyer actually shops from, and they span more than 20x from cheapest to dearest.

Blended listed price per million tokens, the eight study models
Item	Value
Claude Opus 4.7	$30.00
GPT-5.4	$17.50
Gemini 3.1 Pro	$14.00
Claude Haiku 4.5	$6.00
GPT-5.4 Mini	$5.25
Kimi K2.6	$4.95
Gemini 3 Flash	$3.50
MiniMax M2.7	$1.50

The eight models' blended listed prices per million tokens: the ladder a buyer shops from before any work runs. Gemini 3 Flash (highlighted) sits near the bottom at $3.50, which is exactly why its higher cost to actually finish the work is the surprise. Listed price spans 20x across the set.Source: The Price Reversal Phenomenon (arXiv 2603.23971), Chen et al., 2026, Table 3

The headline result: across all model pairs studied, 32% exhibit a price reversal, where the model with the lower listed price incurs the higher total cost. The largest reversal reached 28x. In other words, ranking models by their pricing page would have pointed you at the more expensive option in roughly one of every three comparisons.

The headline reversal: 80% cheaper on paper, 38% more in practice

Take the paper’s clearest pair. By listed price, Gemini 3 Flash ($3.50 per million tokens in the study’s blended figure) is 80% cheaper than GPT-5.4 ($17.50). Run the full twelve-suite gauntlet, though, and Gemini 3 Flash cost $705 against GPT-5.4’s $509. The “cheaper” model came in 38% higher on the bill that arrives at the end of the month.

The reversal: listed price vs actual cost, relative to GPT-5.4
Item	Listed price	Actual cost to finish
Gemini 3 Flash	0.20x	1.38x
GPT-5.4	1.00x	1.00x

Indexed to GPT-5.4 = 1.0x on each axis. Gemini 3 Flash starts 80% below GPT-5.4 on listed price (0.20x) and ends 38% above it on the actual cost to finish the work (1.38x). The lines cross, and that crossing is the price reversal.Source: The Price Reversal Phenomenon (arXiv 2603.23971), Chen et al., 2026

The gap is not rounding error. It is the difference between a budget you set from the pricing page and the number that actually clears.

Why it happens: you pay per token, models spend them differently

The mechanism is not exotic. You do not pay per question. You pay per token, and reasoning models emit two kinds of output: the visible answer, and a much larger stream of hidden “thinking” tokens they generate while working toward it. Those thinking tokens are billed at the output rate, and they dominate the bill. So your real cost is sticker price multiplied by tokens consumed, and consumption is the variable the pricing page never shows.

How big is the variation? On the same query, the study found one model can use 900% more thinking tokens than another, and on agentic tasks take 10x more turns of environment interaction. A model can be a fifth of the price per token and still spend its way past a rival because it thinks five times as long to reach the same answer. The pricing page shows you the first number and hides the second.

32%

of model pairs reverse

cheaper sticker, higher bill

28x

worst-case reversal

gap between price rank and real cost

900%

more thinking tokens

same query, one model vs another

10x

more agent turns

same agentic task, one model vs another

This is the same dynamic that makes a low per-token rate a poor predictor of an agent’s real bill. It is why cost per task, not cost per token, is the only honest unit for an agentic workload, and why tools that cut token consumption can move the bill more than switching to a cheaper-listed model.

See the reversal yourself. Rank today’s models by what they cost to finish the same task, and the order rarely matches the sticker prices in the ladder above:

Cost-per-task calculator

Modeled estimate

Model Task type

Tasks per day

Cost per task

$1.46

Claude Sonnet 4.6

At 3 tasks/day

$96.03/mo

22 working days

Where the money goes

Cache reads: $0.405 (28%)
Fresh input: $0.450 (31%)
Output: $0.600 (41%)

A flat $10/mo plan (GitHub Copilot Pro) pays for itself above about 6.9 tasks/month at this cost-per-task. You are modeling 66. Below the break-even, pure usage billing is cheaper; above it, the subscription is.

Every model on this exact task, cheapest first

Model	Sticker (in/out)	Cost/task
DeepSeek V4	$0.435/$0.87	$0.105
Gemini 3 Flash	$0.5/$3	$0.262
Claude Haiku 4.5	$1/$5	$0.485
Kimi K2.7 Code	$0.95/$4	$0.559
GLM-5.2	$1.4/$4.4	$0.737
Gemini 3.5 Flash	$1.5/$9	$0.787
Qwen3.7 Max	$2.5/$7.5	$1.01
Gemini 3.1 Pro	$2/$12	$1.05
Claude Sonnet 4.6 selected	$3/$15	$1.46
Claude Opus 4.8	$5/$25	$2.42
GPT-5.5	$5/$30	$2.63
Claude Fable 5	$10/$50	$4.85

Modeled from published per-token API rates and stated token assumptions, not a benchmark. Real cost varies with your codebase and how tightly you scope each request; the same task can swing by an order of magnitude between runs. Open the full cost-per-task calculator to adjust every assumption.

Your bill is not even stable

The reversal is the eye-catching finding. The quieter one is worse for anyone trying to forecast spend. The study found that repeated runs of the same query on the same model yielded thinking-token variation of up to 9.7x. Same prompt, same model, and the cost can land anywhere across nearly an order of magnitude depending on how long the model happened to reason that time.

So per-task cost is not a point. It is a distribution. Budgeting on the average understates what your heaviest tasks will do, and a run that thinks for ten times its usual length is not an outlier you can design away. It is a property of how reasoning models sample.

The same query, same model, run twice: the cost swings up to 9.7x
Item	Most expensive run	Cheapest run
Same query, same model	9.7x	1.0x

Indexed so the cheapest run = 1.0x. Repeated runs of the identical query on the same model varied in cost by up to 9.7x in the study, almost an order of magnitude, with nothing changed but the model's own sampling. Cost per task is a range, not a number.Source: The Price Reversal Phenomenon (arXiv 2603.23971), Chen et al., 2026

The preprint itself is a moving target, which only sharpens the point. An earlier version (March 2026) reported reversals in 21.8% of pairs across nine tasks and used GPT-5.2 as the comparison; the current version (May 2026) reports 32% across twelve tasks against GPT-5.4. The headline rate climbed as the test set grew. When the measured frequency of “cheaper costs more” rises every time someone adds tasks and re-runs the numbers, treating any single per-token comparison as settled is the error.

The list price is what a model advertises. The bill is what it does.

What to measure instead: cost to finish the work

If unit price is the wrong number, the right one is the total cost to complete a representative set of your own tasks, thinking tokens and retries included. The public version of this metric already exists. Artificial Analysis, an independent benchmarking firm, publishes a “cost to run the Intelligence Index” figure: the actual dollar cost for each model to complete its full benchmark suite, not a per-token rate. It is the same idea the preprint formalizes, and even models of nearly identical measured intelligence cost very different amounts to run.

Cost to run the Intelligence Index vs measured intelligence
Item	Artificial Analysis Intelligence Index	Cost to run the Index (USD)
Fable 5	60	$6.2K
Opus 4.8	56	$3.7K
GPT-5.5	55	$2.9K
Opus 4.7	54	$4.4K

Artificial Analysis's published cost to run its full Intelligence Index, plotted against each model's Intelligence Index score. Four frontier configurations of nearly identical measured intelligence (54-60) range from $2.9K to $6.2K to run the same suite, and the cost does not track the intelligence: Opus 4.7 costs more than the higher-scoring Opus 4.8. Cost to finish the work is its own axis, not a reading off the brochure.Source: Artificial Analysis, Intelligence Index and cost to run, 2026

The practical move follows directly. Before you commit to a model, run a representative slice of your real tasks through each candidate, measure the total bill, and rank on that. Do not rank on the pricing page. And measure it on your own workload, not a public leaderboard: as the benchmark numbers themselves show, the same task suite produces different results depending on the harness around it, so the only cost that predicts your bill is the one measured on your tasks, through your harness.

If you build on top of LLMs

Two consequences follow for anyone selling a product built on model calls, and both are about margin rather than benchmarks.

First, your cost of goods sold is not the sticker price. It is sticker price multiplied by consumption, and consumption is model-specific, variable, and partly random. A model you picked because the rate looked low can carry a higher true COGS than the one you rejected. The only way to know is to measure the finished-work cost, the way the study does, on the tasks your product actually runs.

Second, if you charge a flat fee on top of metered usage, the variance works against you in the dark. Your blended cost looks fine because light users subsidize the average, while your heaviest users quietly cross into negative margin on a number you do not control and cannot fully predict. Either price on measured cost per task with explicit headroom for the 9.7x tail, or meter usage through to the customer so their behavior, not your spreadsheet, carries the variance. This is the same discipline the rest of the 2026 cost-per-task picture rewards: the operators who win on unit economics are the ones measuring the bill, not the brochure.

The bottom line

Per-token price is a marketing surface. It is real, it is published, and it is close to useless on its own for predicting what a workload costs, because it omits the one variable that drives the bill: how many tokens a model burns to finish the job, which is large, model-specific, and unstable run to run. A model can be 80% cheaper on the page and 38% more expensive in the account. Across a third of frontier model pairs, the pricing page points the wrong way.

Treat the list price as the opening bid, not the cost. Measure cost to finish the work on your own tasks, plan against the expensive tail rather than the average, and you will price and provision off a number that actually clears. Everyone still picking models off the pricing page is, in about one case in three, optimizing for the wrong one.

Sources

Chen, L., Zhang, C., He, Y., Stoica, I., Zaharia, M., & Zou, J. (2026). The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More. arXiv preprint (v2, May 2026). arxiv.org/abs/2603.23971
Microsoft Research. (2026). The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More (publication page). microsoft.com/en-us/research/publication
Artificial Analysis. (2026). Models: Intelligence, Performance & Price (cost to run the Intelligence Index). artificialanalysis.ai/models
Artificial Analysis. (2026). Claude Fable 5 cost ~$6.2K to run the Intelligence Index [post]. x.com/ArtificialAnlys
Artificial Analysis. (2026). MiniMax-M2.7 cost ~$175 to run the Intelligence Index [post]. x.com/ArtificialAnlys