Skip to content
Capital & Compute
· Updated June 21, 2026· ai· coding-agents· benchmarks· pricing· economics

Claude Code vs Codex (2026): Cost and Capability

Codex leads Claude Code 83.4% to 78.9% on Terminal-Bench 2.1, but the prices match and the cheaper agent flips with your harness. The honest scorecard.

By Capital & Compute

The live leaderboard makes this look settled. On Terminal-Bench 2.1, the independent CLI-agent benchmark at tbench.ai, Codex running GPT-5.5 sits at number one with 83.4%, and Claude Code running Opus 4.8 sits fourth at 78.9%. A four-and-a-half point gap, advantage Codex.

Read the second row and the story changes. Rank two, at 83.1%, is also Claude Code: the same harness running Claude Fable 5, statistically level with Codex once you count the error bars. That model went dark on June 12, 2026, when a US government export-control directive forced Anthropic to suspend it. So the gap at the top of the board is not only a fact about capability. It is partly a fact about which models you are allowed to run this week.

That is the trap in any “Claude Code vs Codex” verdict in 2026. The two coding agents are close enough that the answer turns on details most comparisons skip: which model is live, which harness you measure inside, and how many tokens each burns to finish the same task. This is the honest scorecard, on the only two axes that decide it: capability and cost.

Claude Code
Anthropic CLI · live model: Opus 4.8
VS
Codex
OpenAI CLI · live model: GPT-5.5
78.9%
Terminal-Bench 2.1
83.4%
69.2%
SWE-bench Pro
58.6%
$20/mo
Entry plan
$20/mo
$25
API output / 1M
$30
Fable 5 offline
Top model status
GPT-5.5 live

Capability: a converged fingerprint, not a winner

Hold the two side by side across the benchmarks that matter for coding and reasoning and the picture is not a ladder. It is two shapes that bulge in different places. Codex pushes out on the terminal-agent and abstract-reasoning axes. Claude Code pushes out on hard codebase resolution. On the academic-reasoning and general-intelligence axes they sit almost on top of each other.

Claude Code (Opus 4.8) vs Codex (GPT-5.5) capability fingerprintRadar chart over six axes normalized to 100. Terminal CLI: Claude Code 78.9, Codex 83.4. SWE Verified: 88.6 vs 88.7. SWE Pro: 69.2 vs 58.6. GPQA: 93.6 vs 93.6. ARC-AGI-2: 72.1 vs 85.0. AA Index: 61.4 vs 60.2. The two fingerprints are close, with Codex larger on Terminal CLI and ARC-AGI-2 and Claude Code larger on SWE Pro.255075100Terminal CLISWE VerifiedSWE ProGPQAARC-AGI-2AA IndexClaude Code (Opus 4.8)Codex (GPT-5.5)
Claude Code (Opus 4.8) vs Codex (GPT-5.5) capability fingerprint
AxisClaude Code (Opus 4.8)Codex (GPT-5.5)
Terminal CLI78.983.4
SWE Verified88.688.7
SWE Pro69.258.6
GPQA93.693.6
ARC-AGI-272.185.0
AA Index61.460.2
Capability fingerprint, each axis normalized to 100. Codex (GPT-5.5) bulges on Terminal-Bench and ARC-AGI-2; Claude Code (Opus 4.8) bulges on SWE-bench Pro; the two are level on SWE-bench Verified, GPQA Diamond, and the Artificial Analysis Intelligence Index. Terminal-Bench 2.1 and the AA Index are independent; SWE-bench Verified, SWE-bench Pro, and GPQA Diamond are vendor-reported; ARC-AGI-2 is from the ARC Prize leaderboard. Shapes this similar are the whole point: the models have converged.Source: tbench.ai, Artificial Analysis, ARC Prize, and vendor model cards, June 2026

A caveat the chart cannot show, and that the citations demand: those axes are not all measured the same way. Terminal-Bench 2.1 and the Artificial Analysis Intelligence Index are independent third-party measurements. SWE-bench Verified, SWE-bench Pro, and GPQA Diamond are vendor-reported figures from each lab’s own model card, and independent re-counts run lower: vals.ai, which re-runs SWE-bench Verified itself, scores the field in the low 80s rather than the high 80s, with the two models still within a point of each other. ARC-AGI-2 is from the ARC Prize leaderboard. State that openly, because a fingerprint built from mixed sources is only as honest as its labels.

What the shape says, read with those caveats, is that the frontier has converged. That matches what an earlier look at the 2026 benchmark landscape found: when models cluster this tightly, the harness around a model moves the result as much as the weights inside it. Which is exactly why the next chart matters more than this one.

Round by round: who actually wins each test

The fingerprint shows overlap. The scorecard shows the deltas, round by round, with the winner of each bolded. This is the chart to read if you want a verdict per task type rather than a vibe.

Benchmark scorecard: Claude Code (Opus 4.8) vs Codex (GPT-5.5)Diverging bar chart, six benchmark rounds. Terminal-Bench 2.1: Claude Code 78.9, Codex 83.4 (Codex wins). SWE-bench Verified: 88.6 vs 88.7 (tie). SWE-bench Pro: 69.2 vs 58.6 (Claude Code wins). GPQA Diamond: 93.6 vs 93.6 (tie). ARC-AGI-2: 72.1 vs 85.0 (Codex wins). AA Intelligence Index: 61.4 vs 60.2 (Claude Code wins).Claude Code (Opus 4.8)Codex (GPT-5.5)Terminal-Bench 2.1independent78.983.4SWE-bench Verifiedvendor-reported88.688.7SWE-bench Provendor-reported69.258.6GPQA Diamondvendor-reported93.693.6ARC-AGI-2ARC Prize72.185.0AA Intelligence Indexindependent61.460.2
Benchmark scorecard: Claude Code (Opus 4.8) vs Codex (GPT-5.5)
MetricClaude Code (Opus 4.8)Codex (GPT-5.5)
Terminal-Bench 2.178.983.4
SWE-bench Verified88.688.7
SWE-bench Pro69.258.6
GPQA Diamond93.693.6
ARC-AGI-272.185.0
AA Intelligence Index61.460.2
Head-to-head by benchmark, both sides on a 0-100 scale, the leading bar drawn solid. Codex wins the terminal and abstract-reasoning rounds; Claude Code wins hard codebase resolution; they split or tie the rest. No round is a blowout, which is the point. Independent: Terminal-Bench 2.1 (tbench.ai), AA Index (Artificial Analysis). Vendor-reported: SWE-bench Verified, SWE-bench Pro, GPQA Diamond. ARC Prize: ARC-AGI-2.Source: tbench.ai, Artificial Analysis, ARC Prize, and vendor model cards, June 2026

The split is legible. Codex takes the rounds about driving a terminal and reasoning over unfamiliar abstractions, the 13-point Terminal-Bench gap that held on the previous 2.0 version too. Claude Code takes the round that arguably looks most like real maintenance work: SWE-bench Pro, where resolving issues inside a large existing codebase rewards the more deliberate planner. On SWE-bench Verified, GPQA Diamond, and the Artificial Analysis Intelligence Index, the difference is inside the noise.

One number worth stating in plain terms, because Anthropic states it differently. Anthropic’s own model card reports Opus 4.8 on Terminal-Bench 2.1 at 82.7%, which would close most of the gap to Codex. The 78.9% used here is the independent tbench.ai leaderboard figure, run in a standardized harness. The rule this site holds to: a vendor’s self-reported number is a vendor’s self-reported number, and the independent run is the one that goes on the scorecard.

The asterisk on the leaderboard: a model pulled by the state

Here is the fact that reorders everything above. The model that would put Claude Code level with Codex at the top of Terminal-Bench, Claude Fable 5 at 83.1%, is not available to run.

On June 12, 2026, Anthropic published a statement that the US government, citing national security authorities, had issued an export-control directive to suspend all access to Fable 5 and its larger sibling Mythos 5. The stated concern was a method of bypassing safeguards meant to limit the model’s use for certain cybersecurity tasks, including identifying software vulnerabilities. Rather than gate access by nationality, as reported by Al Jazeera, Anthropic disabled the two models for every customer. It is the first time US export controls have been applied to an AI model itself rather than to the chips that train it.

This is the kind of variable a benchmark table is not built to hold, and it is now a real input to a tooling decision. Model availability has become a function of policy, not just uptime.

Cost: the prices match, the bill does not

Start with what is genuinely identical. The subscription ladders are the same to the dollar: OpenAI’s Codex plans run $20 a month for Plus and scale to $100 and $200 for the Pro tiers, and Anthropic’s Claude plans run $20 for Pro, $100 for Max 5x, and $200 for Max 20x. If you buy access by subscription, there is no price difference to discuss.

The list prices on the metered API nearly match too. Opus 4.8 lists at $5 per million input tokens and $25 per million output, unchanged from Opus 4.7. GPT-5.5, which OpenAI launched on April 23, 2026, lists at $5 input and $30 output, after OpenAI doubled the GPT-5 line’s rates (input from $2.50, output from $15). On paper Claude is a touch cheaper per output token.

$20
entry plan, both
identical $20 / $100 / $200 ladder
$5 / $25
Opus 4.8 per 1M
input / output, API list
$5 / $30
GPT-5.5 per 1M
input / output, API list
3-4x
tokens per task
Claude Code vs Codex, reported

List price is the opening bid, not the bill. The number that clears is the cost to finish a representative task, and that depends on how many tokens the agent burns getting there, which is a property of the harness as much as the model. Here the evidence pulls in two directions, and an honest answer has to hold both.

The one independent, like-for-like measurement points Claude’s way. On Artificial Analysis’s Coding Agent Index, which runs each model inside the harness it ships in, the May 2026 run had Opus 4.7 in Claude Code finishing a task for $4.10 against GPT-5.5 in Codex at $4.82, with Claude scoring a point higher (66 to 65). On that test, Claude Code was both better and cheaper per task. No independent re-run on Opus 4.8 had been published as of this writing.

The community measurements point the other way. Comparison write-ups that log token usage on real tasks, such as Morph’s Codex-versus-Claude-Code teardown, report Claude Code consuming roughly three to four times the tokens of Codex on identical jobs, because its agent loop tends to take more turns. At a 3x to 4x token multiple, Claude’s slightly lower per-token rate is erased and then some, and Codex becomes the cheaper finisher. These are secondary, self-reported measurements, not a controlled benchmark, so weight them accordingly.

The reconciliation is that they measure different things. Artificial Analysis fixes the harness and the task suite; the community logs measure whatever workload the author ran, in whatever effort mode, with whatever guardrails. This is the same headline-to-bill drift a 2026 Microsoft Research preprint measured across frontier models, where the cheaper-listed model finished the work at a higher cost in roughly a third of matchups. The discipline it forces is the one this site keeps returning to: the only cost figure worth trusting is cost to finish a representative slice of your own tasks, measured in your own harness.

The verdict: pick the harness, not the headline

There is no clean winner, and pretending otherwise would be the dishonest move. The capability fingerprints overlap, the subscription prices are identical, and the per-task cost flips depending on whose measurement you trust and which workload you run.

So decide on fit, not on a single number:

  • Choose Codex if your work lives in the terminal and rewards token thrift: CLI-heavy automation, abstract problem-solving, and high-volume jobs where fewer turns per task compounds into real savings. It holds the live Terminal-Bench lead and the leaner token profile.
  • Choose Claude Code if your work is resolving issues inside large existing codebases, where its edge on SWE-bench Pro and its more deliberate planning loop pay off, and where the independent cost-per-task run currently favors it.
  • Watch the availability axis, which is new. The strongest Claude Code configuration is offline by government order, and “which models can I legally run” is now part of a serious tooling decision, not a footnote.

Run both on a representative slice of your actual work for a week, log the tokens and the cost, and let your codebase break the tie. On numbers this close, your workload is the only benchmark that counts.

Frequently asked questions

Is Codex or Claude Code better in 2026?
Capability is converged, not separated. On the independent Terminal-Bench 2.1 leaderboard, Codex on GPT-5.5 leads at 83.4% and Claude Code on Opus 4.8 trails at 78.9%, but Claude Code on Fable 5 ties Codex at 83.1%. They split the benchmarks: Codex is stronger on terminal and abstract-reasoning tasks, Claude Code on hard codebase resolution (SWE-bench Pro).
Is Codex or Claude Code cheaper?
The subscription ladders are identical at $20, $100, and $200 a month, and API list prices nearly match ($5/$25 per million tokens for Opus 4.8, $5/$30 for GPT-5.5). Per-task cost flips by harness and workload: one independent run put Claude Code cheaper ($4.10 vs $4.82), while community measurements report Claude Code burning three to four times the tokens, which would make Codex the cheaper finisher.
How much do Claude Code and Codex cost per month?
Both follow the same subscription ladder. Codex runs $20 a month for ChatGPT Plus and scales to $100 and $200 for the Pro tiers; Claude runs $20 for Pro, $100 for Max 5x, and $200 for Max 20x. If you buy access by subscription, there is no price difference.
Why was Claude Fable 5 pulled offline?
A US government export-control directive forced Anthropic to suspend Fable 5 on June 12, 2026. Fable 5 was Claude Code's co-leader at 83.1% on Terminal-Bench 2.1, so Opus 4.8 is Claude Code's live ceiling against Codex right now.
Should I choose Codex or Claude Code?
Pick by fit, not a single number. Choose Codex for terminal-heavy, token-thrifty work where it holds the live Terminal-Bench lead. Choose Claude Code for resolving issues inside large codebases, where its SWE-bench Pro edge and the independent cost-per-task run currently favor it. Which models you can legally run is now part of the decision too.

Sources

Anthropic (2026). Statement on the US government directive to suspend access to Fable 5 and Mythos 5. Anthropic. https://www.anthropic.com/news/fable-mythos-access

Anthropic (2026). Pricing. Anthropic (vendor documentation). https://www.anthropic.com/pricing

Al Jazeera (2026, June 14). US asks Anthropic to block global access to top AI models: Why it matters. Al Jazeera (secondary coverage). https://www.aljazeera.com/news/2026/6/14/us-asks-anthropic-to-block-global-access-to-top-ai-models-why-it-matters

Artificial Analysis (2026). Coding Agent Index and Intelligence Index. Artificial Analysis (independent benchmarking). https://artificialanalysis.ai/

ARC Prize (2026). ARC-AGI Leaderboard. ARC Prize Foundation. https://arcprize.org/leaderboard

Morph (2026). Codex vs Claude Code. Morph (secondary comparison). https://www.morphllm.com/comparisons/codex-vs-claude-code

OpenAI (2026, April 23). Introducing GPT-5.5. OpenAI. https://openai.com/index/introducing-gpt-5-5/

OpenAI (2026). Codex pricing. OpenAI (vendor documentation). https://developers.openai.com/codex/pricing

Terminal-Bench (2026). Terminal-Bench 2.1 leaderboard. tbench.ai (independent benchmark). https://www.tbench.ai/leaderboard/terminal-bench/2.1

vals.ai (2026). SWE-bench Verified independent leaderboard. vals.ai (independent benchmarking). https://www.vals.ai/

Subscribe to Capital & Compute

Source-backed analysis of what AI compute really costs, sent when a new post goes live.

No spam. Unsubscribe anytime.

← Back to all posts