What are the main AI benchmarks in 2026?

For coding, SWE-bench Verified, DeepSWE, and Terminal-Bench. For reasoning, GPQA Diamond, ARC-AGI-2, and Humanity’s Last Exam. For math, FrontierMath and AIME. For broad knowledge, MMLU-Pro. For multimodal, MMMU. And for overall human preference, the LMArena Elo ranking. Many older benchmarks like MMLU, GSM8K, and HumanEval are now saturated and quoted mainly out of habit.

What does it mean when a benchmark is saturated?

A benchmark is saturated when the strongest models all score near its ceiling, so the differences between them are within noise and the benchmark no longer separates a better model from a worse one. MMLU, GSM8K, MATH, and HumanEval are all saturated in 2026, with top models above 95%, which is why the field keeps building harder replacements like MMLU-Pro and ARC-AGI-2.

What is the difference between SWE-bench and SWE-bench Verified?

SWE-bench is the original 2,294-task set of real GitHub issues. SWE-bench Verified is a 500-task subset that OpenAI and the SWE-bench authors hand-checked in 2024 to remove broken tests and unsolvable issues, so it is the cleaner, more-quoted version. By 2026 even Verified is treated as saturated and contamination-prone, and OpenAI now recommends the harder SWE-bench Pro instead.

Which AI benchmark matters most for coding?

There is no single one, because they measure different things. SWE-bench Verified and DeepSWE measure whether an agent can resolve a real issue; Terminal-Bench measures whether it can operate a real terminal end to end; FrontierCode measures whether the code is clean enough to merge; LiveCodeBench measures contamination-free competitive programming. For shipping production code, read a completion benchmark and a quality benchmark together rather than trusting one number.

What is benchmark contamination?

Contamination is when a benchmark’s questions or answers leak into a model’s training data, so the model can recall the answer instead of reasoning it out. It inflates scores without reflecting real capability and is the main reason public, static benchmarks decay over time. The defenses are private or held-out test sets, time-stamped problems released after a model’s cutoff, and freshly generated tasks.

What is GPQA Diamond, and why is it hard?

GPQA Diamond is a 198-question set of graduate and PhD-level biology, physics, and chemistry questions written by domain experts to be Google-proof, meaning a non-expert with web access still cannot answer them quickly. It tests reasoning over retrieval. By 2026 top models exceed the roughly 70% human-expert baseline and sit in the low-to-mid 90s, so it is now largely saturated.

Directory· Updated June 29, 2026

AI benchmarks

Name: AI benchmark directory: what each major LLM benchmark measures
Creator: Capital & Compute
License: https://creativecommons.org/licenses/by/4.0/

Every model launch quotes a wall of benchmark names. This directory maps 55 of them across 9 categories, from the coding and agent tests builders watch most to reasoning, math, knowledge, long context, multimodal, human preference and safety: what each one measures, who built it, the year, how it is scored, a representative current top score, and a link to its leaderboard. For models ranked by value, see the value leaderboard; for why these scores are easier to trust some years than others, our guide on whether AI benchmarks are reliable.

What are the main AI benchmarks?

The most-watched AI benchmarks in 2026, by what they test, are:

SWE-bench Verified and DeepSWE: real-world coding agents
Terminal-Bench: agentic command-line tasks
GPQA Diamond: PhD-level science reasoning
Humanity’s Last Exam: hardest cross-domain reasoning
ARC-AGI-2: novel abstract reasoning
FrontierMath and AIME: competition and research math
MMLU-Pro: broad academic knowledge
LMArena: human-preference ranking (Elo)
MMMU: college-level multimodal understanding

Older sets like MMLU, GSM8K and HumanEval are now saturated, with top models above 95%, so they are quoted mainly out of habit.

Benchmarks mapped

Across 9 capability categories

Coding & agent tests

The deepest category, and the one builders watch most

Now saturated

Top models near the ceiling, so the score no longer separates them

2026

Newest entries

DeepSWE, FrontierCode, Terminal-Bench

What is an AI benchmark?

An AI benchmark is a standardized test, made of a fixed dataset, a task specification, and a scoring metric, used to measure and compare how well AI models perform a specific skill such as reasoning, coding, math, or knowledge. Running many models through the same test produces a single comparable number, which is what a model leaderboard ranks. The catch is that a benchmark only stays meaningful while it is hard: once the frontier clears it, or its answers leak into training data, the score stops telling good models from great ones, and the field has to build a harder one.

Benchmarks by category

The 55 benchmarks split into 9 categories. Coding and agents is the largest, both because it is the most commercially watched skill and because contamination forced builders to keep replacing the older sets. To search or filter all of them at once, use the directory tool below.

AI benchmarks tracked, by category
Item	Value
Coding	15
Agents	8
Reasoning	6
Mathematics	6
Knowledge	4
Long context	5
Multimodal	4
Human preference	4
Safety	3

How the tracked benchmarks split across the nine categories. Coding and software-engineering agents is the deepest, reflecting both demand and the churn of contamination-driven replacements.Source: Capital & Compute AI benchmark directory, June 29, 2026

Coding & software-engineering agents

Whether a model can write, edit, and fix real code, increasingly as a multi-step agent working in a real repository. The most-watched category for AI builders, and where contamination bites hardest.

Benchmark	What it measures	Maker · Year	Status	Top score
Aider PolyglotPercent correct after the second attempt, plus percent using the correct edit format	How well a model writes and correctly edits code across many languages, including applying diffs in the right format and self-correcting after test failures.	Aider2024	Active	See leaderboard
BigCodeBenchpass@1 against rigorous per-task test suites	Whether models can write code that correctly invokes multiple function calls from diverse real libraries to satisfy complex, practical instructions.	BigCode project2024	Active	See leaderboard
DeepSWEpass@1 (committed code graded in a clean environment)	Whether frontier coding agents can complete original, long-horizon engineering tasks written from scratch, with no upstream PR to memorize.	Datacurve2026	Active	~70%Claude Fable 5 (max)
FrontierCodePass rate on blocker criteria plus a weighted six-dimension quality rubric	Whether a coding agent produces a mergeable, production-quality pull request, not just one that passes tests, judged on correctness, regression safety, scope, tests and style.	Cognition2026	Active	13.4% (Diamond)Claude Opus 4.8
HumanEvalpass@k (primarily pass@1)	Whether a model can synthesize a single correct Python function from a docstring so that it passes the provided unit tests.	OpenAI2021	Saturated	~99%Frontier models broadly
LiveCodeBenchpass@1	Code generation and related skills (self-repair, execution, test-output prediction) on fresh competitive-programming problems, designed to be contamination-free.	UC Berkeley, MIT and Cornell2024	Active	See leaderboard
MBPPpass@1	Whether a model can generate short, entry-level Python functions from a natural-language prompt that pass the provided tests.	Google Research2021	Saturated	~95%+Frontier models broadly
Multi-SWE-bench% resolved (pass@1)	Cross-language issue resolution: whether agents can resolve real GitHub issues with a passing patch across many languages beyond Python.	ByteDance2025	Active	See leaderboard
RepoBenchRetrieval accuracy and exact-match / edit similarity for next-line completion	Repository-level code auto-completion: retrieving relevant cross-file context, predicting the next line, and the combined retrieval-plus-completion pipeline.	Liu, Xu and McAuley2023	Active	See leaderboard
SWE-bench% resolved (pass@1)	Whether a system can resolve a real GitHub issue by generating a patch that passes the repository's hidden tests.	Princeton and Stanford2023	Saturated	See leaderboard
SWE-bench Multimodal% resolved (pass@1)	Whether coding agents can resolve real GitHub issues in visual, user-facing JavaScript software where the bug or feature involves the UI.	Stanford and Princeton2024	Active	See leaderboard
SWE-bench Pro% resolved (pass@1) under standardized agent scaffolding	Whether agents can solve long-horizon, enterprise-grade software-engineering tasks under standardized scaffolding, designed to resist contamination.	Scale AI2025	Active	59.1% (public set)GPT-5.4 (xHigh)
SWE-bench Verified% resolved (pass@1)	The same real-GitHub-issue resolution task as SWE-bench, restricted to a human-validated subset where the issue is solvable and the tests are not broken.	OpenAI2024	Saturated	~95%Claude Fable 5
SWE-LancerDollars earned (and % of tasks resolved)	Whether frontier models can complete real paid freelance software jobs, both coding and technical-management tasks, well enough to earn the payouts.	OpenAI2025	Active	See leaderboard
Terminal-BenchPass/fail, graded by verification scripts in the agent's Docker environment (pass@1)	Whether an AI agent can complete hard, realistic command-line tasks (build, configure, train, debug, secure) end to end inside a real terminal.	Stanford and the Laude Institute2026	Active	~82%Codex (GPT-5.5)

Agents, tool use & computer use

Whether a model can plan, call tools, browse, and operate a computer or website to finish open-ended tasks, not just answer a question in one shot.

Benchmark	What it measures	Maker · Year	Status	Top score
AgentBenchPer-environment success aggregated into an overall score	How well an LLM acts as an autonomous agent in multi-turn, open-ended decision-making across diverse interactive environments.	Tsinghua University2023	Active	See leaderboard
BrowseCompAccuracy via model-graded semantic equivalence to the reference answer	Whether a browsing agent can persistently navigate the open web to locate a single hard-to-find, entangled fact.	OpenAI2025	Active	51.5%OpenAI Deep Research (launch paper)
GAIAExact-match accuracy against an unambiguous answer	Whether an AI assistant can answer real-world questions that require multi-step reasoning, multiple modalities, web browsing and general tool use.	Meta AI and Hugging Face2023	Active	~75%HAL agent (Claude Sonnet 4.5)
MLE-benchMedal rate (fraction of competitions reaching bronze/silver/gold thresholds)	Whether an AI agent can do end-to-end machine-learning engineering (data prep, training, experimentation, submission) at the level of human Kaggle competitors.	OpenAI2024	Active	16.9% (paper baseline)o1-preview with AIDE scaffolding
OSWorldExecution-based success rate via per-task verification scripts that inspect machine state	Whether a multimodal agent can operate a real computer (desktop apps, file I/O, multi-app workflows) to complete open-ended tasks in a live virtual machine.	XLANG Lab, University of Hong Kong2024	Active	See leaderboard
tau-benchpass^k: the probability an agent succeeds across all k independent trials (reliability, not just average success)	Whether a tool-using agent can reliably complete customer-service tasks over multi-turn conversations with a simulated user while obeying domain policies.	Sierra2024	Active	See leaderboard
VisualWebArenaFunctional success rate via execution-based evaluation	Whether a multimodal agent can complete visually grounded web tasks that require interpreting images and page layout, not just text.	Carnegie Mellon University2024	Active	See leaderboard
WebArenaFunctional success rate via execution-based reward checking the end state	Whether an autonomous agent can complete long-horizon, realistic web tasks (navigation, forms, multi-step workflows) in fully functional self-hosted websites.	Carnegie Mellon University2023	Active	See leaderboard

Reasoning & abstraction

Hard multi-step reasoning and fluid, novel problem-solving designed to resist memorization. The benchmarks the frontier is still far from solving.

Benchmark	What it measures	Maker · Year	Status	Top score
ARC-AGI-1pass@2 exact-grid-match accuracy	Whether a system can infer the abstract rule of a novel visual grid puzzle from a few examples and apply it to a new input.	Francois Chollet2019	Active	87.5% (high compute)OpenAI o3-preview
ARC-AGI-2pass@2 exact-grid-match accuracy, reported with a cost-per-task efficiency metric	The same fluid-intelligence test as ARC-AGI-1, but with harder, contamination-resistant tasks that stay easy for humans yet very hard for AI.	ARC Prize Foundation2025	Active	54% (verified)Poetiq (Gemini-based solver)
BIG-Bench HardPer-task accuracy averaged across the 23 tasks	A suite of multi-step reasoning tasks (logic, arithmetic, algorithmic, commonsense) on which pre-2022 models trailed average human raters.	Suzgun et al.2022	Saturated	See leaderboard
GPQA DiamondMultiple-choice accuracy (random baseline 25%, PhD-expert baseline about 70%)	Graduate and PhD-level multiple-choice scientific reasoning in biology, physics and chemistry, on questions designed to be unanswerable by quick web search.	Rein et al.2023	Saturated	~94%Gemini 3.1 Pro Preview
Humanity's Last ExamAccuracy (exact match / multiple-choice), often reported with a calibration metric	Frontier, closed-ended expert knowledge and reasoning across more than 100 academic disciplines at the limit of human expertise.	Center for AI Safety (CAIS) and Scale AI2025	Active	53.3%Claude Fable 5 (Max Effort)
MuSRMultiple-choice accuracy	Multistep commonsense reasoning embedded in long natural-language narratives such as murder mysteries, object placement and team allocation.	Sprague, Ye, Durrett et al.2023	Active	See leaderboard

Mathematics

From grade-school word problems to research-level proofs. The older sets are saturated; the newest are held back from the public to stay contamination-resistant.

Benchmark	What it measures	Maker · Year	Status	Top score
AIME 2025Exact-match accuracy, usually pass@1 averaged over samples	Olympiad-track competition mathematics at the level of the American Invitational Mathematics Examination, used as a high-difficulty LLM eval.	Mathematical Association of America; adopted as an LLM eval by the community2025	Saturated	100%Multiple frontier reasoning models
FrontierMathAccuracy (fraction with a correct, automatically verifiable final answer)	Research-level original mathematics requiring hours to days of expert effort, across number theory, analysis, algebraic geometry and more.	Epoch AI2024	Active	52.4%GPT-5.5 Pro
GSM8KExact-match accuracy on the final numeric answer	Multi-step grade-school arithmetic word-problem reasoning.	OpenAI2021	Saturated	~99.6%Frontier models broadly
MATHExact-match accuracy on the final boxed answer	Step-by-step solving of high-school competition mathematics across algebra, geometry, number theory, probability and precalculus.	Hendrycks et al.2021	Saturated	~99% (MATH-500)GPT-5
MathArenaPer-competition accuracy and an aggregate expected-performance score	Mathematical reasoning and proof-writing on freshly released competition problems, evaluated before they can enter training data.	ETH Zurich2025	Active	81.1% (aggregate)GPT-5.5 (xhigh)
Omni-MATHAccuracy, scored with an LLM-based verifier (Omni-Judge)	Olympiad-level mathematical reasoning across a broad range of subdomains and difficulty levels.	Gao, Song, Cai et al.2024	Active	See leaderboard

Knowledge & general QA

Broad academic and factual knowledge across domains, usually multiple-choice. The most-quoted and most-saturated family, now largely replaced by harder variants.

Benchmark	What it measures	Maker · Year	Status	Top score
MMLUAccuracy	Broad academic and professional knowledge across 57 subjects via four-choice multiple-choice questions.	Hendrycks et al.2021	Saturated	~93%Qwen3.7 Max
MMLU-ProAccuracy	Harder multi-task reasoning and knowledge designed to de-saturate MMLU and reward deliberate reasoning over recall.	TIGER-Lab2024	Active	~90%Gemini 3 Pro Preview
MMLU-ReduxAccuracy on cleaned labels	A re-annotated, error-corrected subset of MMLU used to measure true knowledge accuracy without the original's label noise.	Gema et al.2024	Active	See leaderboard
SimpleQAAccuracy, plus correct-given-attempted and an F-score balancing attempts against accuracy	Short-form parametric factuality: whether a model answers single-answer fact-seeking questions correctly and abstains when unsure.	OpenAI2024	Active	See leaderboard

Long context & retrieval

Whether a model can actually use a very long input, not just accept it: finding facts, resolving references, and reasoning across hundreds of thousands of tokens.

Benchmark	What it measures	Maker · Year	Status	Top score
LongBenchv1: per-task automatic metrics. v2: multiple-choice accuracy	Comprehensive long-context understanding across realistic tasks (QA, summarization, few-shot, code, synthetic) in English and Chinese.	Tsinghua University2023	Active	57.7% (v2, with reasoning)o1-preview
MRCRSimilarity of the model output to the target instance, gated by a required answer-prefix	Whether a model can distinguish and retrieve the correct one among multiple near-identical requests buried in a long multi-turn conversation.	Google DeepMind (Michelangelo); open-source variant by OpenAI2024	Active	See leaderboard
Needle-in-a-HaystackRetrieval accuracy at each depth and length cell	Whether a model can recall a single planted fact (the needle) inserted at varying depths within a long context (the haystack).	Greg Kamradt2023	Saturated	See leaderboard
NoLiMaAccuracy at each length, relative to the model's short-context baseline	Long-context retrieval and reasoning when the question and the target fact share minimal literal word overlap, forcing latent association rather than keyword matching.	Adobe Research and LMU Munich2025	Active	See leaderboard
RULERWeighted-average accuracy across tasks and lengths; effective length is the longest length still above threshold	The real effective context length of a model by testing retrieval, multi-hop tracing, aggregation and QA at increasing sequence lengths.	NVIDIA2024	Active	See leaderboard

Multimodal & vision

Reasoning over images, charts, documents, and video alongside text. The frontier for models that see, not just read.

Benchmark	What it measures	Maker · Year	Status	Top score
MathVistaAccuracy	Mathematical and quantitative reasoning grounded in visual contexts such as figures, charts, geometry and scientific diagrams.	Lu et al.2023	Active	~91% (testmini)Seed 2.1 Pro
MMMUAccuracy	College-level multimodal understanding and reasoning over images, diagrams, charts and text across many disciplines.	MMMU team2023	Active	~86%Qwen3.6 Plus
MMMU-ProAccuracy	A harder, contamination-resistant version of MMMU that forces genuine visual reasoning rather than text-only shortcuts.	MMMU team2024	Active	~84%Gemini 3.5 Flash
Video-MMEAccuracy (tested with and without subtitles)	Comprehensive video understanding by multimodal LLMs across short, medium and long clips.	MME-Benchmarks team2024	Active	~89%Seed 2.1 Pro

Human preference & holistic

Aggregate and head-to-head measures: human-voted arenas, composite indices, and multi-metric frameworks that rank overall capability rather than one skill.

Benchmark	What it measures	Maker · Year	Status	Top score
Artificial Analysis Intelligence IndexComposite index score (0 to 100 aggregate)	A composite index of overall model intelligence aggregating performance across reasoning, coding, knowledge, science and agentic tasks.	Artificial Analysis2024	Active	~60 (index)Claude Fable 5
HELMMulti-metric (per-metric scores across scenarios; no single headline number)	Multi-metric holistic evaluation across many scenarios, reporting accuracy alongside calibration, robustness, fairness, bias, toxicity and efficiency.	Stanford CRFM2022	Active	See leaderboard
LMArenaElo / Bradley-Terry pairwise rating (an Arena Score)	Crowdsourced human preference between two anonymized model responses, aggregated into a relative ranking, not an objective capability.	LMArena2023	Active	~1510 EloClaude Opus 4.8
MT-BenchLLM-as-judge score (1 to 10 scale, averaged)	Instruction-following and conversational quality on multi-turn prompts, scored automatically by a strong LLM judge.	LMSYS2023	Saturated	See leaderboard

Safety, hallucination & factuality

Whether a model tells the truth and resists making things up. Measures honesty and hallucination rate, not raw capability.

Benchmark	What it measures	Maker · Year	Status	Top score
HaluEvalHallucination-recognition accuracy (faithful vs hallucinated)	A model's ability to recognize hallucinated content across question answering, knowledge-grounded dialogue and summarization.	Li et al.2023	Active	See leaderboard
TruthfulQA% truthful (and % truthful-and-informative)	Whether a model avoids repeating common human misconceptions when answering questions, rather than imitating popular falsehoods.	Lin, Hilton, Evans2021	Active	See leaderboard
Vectara Hallucination LeaderboardHallucination rate (% of summaries judged unfaithful; lower is better)	How often a model introduces unsupported content when summarizing a provided source document, i.e. faithfulness in closed-book summarization.	Vectara2023	Active	1.8% (lower is better)antgroup/finix-s1-32b

Search and filter all 55 benchmarks

Filter by category or status, or search by name, alias, what a benchmark measures, or who built it.

Search the directory

Filter all 55 benchmarks by category or status, or search by name, alias, what it measures, or who built it.

55 shown

Category Status Search

Benchmark	What it measures	Maker	Year	Status	Top score
Aider Polyglot Coding & software-engineering agents	How well a model writes and correctly edits code across many languages, including applying diffs in the right format and self-correcting after test failures.	Aider (Paul Gauthier)	2024	Active	See leaderboard
BigCodeBench Coding & software-engineering agents	Whether models can write code that correctly invokes multiple function calls from diverse real libraries to satisfy complex, practical instructions.	BigCode project (Zhuo et al.)	2024	Active	See leaderboard
DeepSWE Coding & software-engineering agents	Whether frontier coding agents can complete original, long-horizon engineering tasks written from scratch, with no upstream PR to memorize.	Datacurve	2026	Active	~70% Claude Fable 5 (max)
FrontierCode Coding & software-engineering agents	Whether a coding agent produces a mergeable, production-quality pull request, not just one that passes tests, judged on correctness, regression safety, scope, tests and style.	Cognition (with 20+ open-source maintainers)	2026	Active	13.4% (Diamond) Claude Opus 4.8
HumanEval Coding & software-engineering agents	Whether a model can synthesize a single correct Python function from a docstring so that it passes the provided unit tests.	OpenAI (Chen et al.)	2021	Saturated	~99% Frontier models broadly
LiveCodeBench Coding & software-engineering agents	Code generation and related skills (self-repair, execution, test-output prediction) on fresh competitive-programming problems, designed to be contamination-free.	UC Berkeley, MIT and Cornell (Jain, Han et al.)	2024	Active	See leaderboard
MBPP Coding & software-engineering agents	Whether a model can generate short, entry-level Python functions from a natural-language prompt that pass the provided tests.	Google Research (Austin, Odena et al.)	2021	Saturated	~95%+ Frontier models broadly
Multi-SWE-bench Coding & software-engineering agents	Cross-language issue resolution: whether agents can resolve real GitHub issues with a passing patch across many languages beyond Python.	ByteDance (ByteDance Seed)	2025	Active	See leaderboard
RepoBench Coding & software-engineering agents	Repository-level code auto-completion: retrieving relevant cross-file context, predicting the next line, and the combined retrieval-plus-completion pipeline.	Liu, Xu and McAuley (UC San Diego)	2023	Active	See leaderboard
SWE-bench Coding & software-engineering agents	Whether a system can resolve a real GitHub issue by generating a patch that passes the repository's hidden tests.	Princeton and Stanford (Jimenez, Yang, Yao et al.)	2023	Saturated	See leaderboard
SWE-bench Multimodal Coding & software-engineering agents	Whether coding agents can resolve real GitHub issues in visual, user-facing JavaScript software where the bug or feature involves the UI.	Stanford and Princeton (Yang, Jimenez et al.)	2024	Active	See leaderboard
SWE-bench Pro Coding & software-engineering agents	Whether agents can solve long-horizon, enterprise-grade software-engineering tasks under standardized scaffolding, designed to resist contamination.	Scale AI (Scale Labs)	2025	Active	59.1% (public set) GPT-5.4 (xHigh)
SWE-bench Verified Coding & software-engineering agents	The same real-GitHub-issue resolution task as SWE-bench, restricted to a human-validated subset where the issue is solvable and the tests are not broken.	OpenAI (with the SWE-bench authors)	2024	Saturated	~95% Claude Fable 5
SWE-Lancer Coding & software-engineering agents	Whether frontier models can complete real paid freelance software jobs, both coding and technical-management tasks, well enough to earn the payouts.	OpenAI (Miserendino, Patwardhan et al.)	2025	Active	See leaderboard
Terminal-Bench Coding & software-engineering agents	Whether an AI agent can complete hard, realistic command-line tasks (build, configure, train, debug, secure) end to end inside a real terminal.	Stanford and the Laude Institute	2026	Active	~82% Codex (GPT-5.5)
AgentBench Agents, tool use & computer use	How well an LLM acts as an autonomous agent in multi-turn, open-ended decision-making across diverse interactive environments.	Tsinghua University (THUDM; Liu et al.)	2023	Active	See leaderboard
BrowseComp Agents, tool use & computer use	Whether a browsing agent can persistently navigate the open web to locate a single hard-to-find, entangled fact.	OpenAI (Wei, Sun et al.)	2025	Active	51.5% OpenAI Deep Research (launch paper)
GAIA Agents, tool use & computer use	Whether an AI assistant can answer real-world questions that require multi-step reasoning, multiple modalities, web browsing and general tool use.	Meta AI and Hugging Face (Mialon, Fourrier et al.)	2023	Active	~75% HAL agent (Claude Sonnet 4.5)
MLE-bench Agents, tool use & computer use	Whether an AI agent can do end-to-end machine-learning engineering (data prep, training, experimentation, submission) at the level of human Kaggle competitors.	OpenAI (Chan et al.)	2024	Active	16.9% (paper baseline) o1-preview with AIDE scaffolding
OSWorld Agents, tool use & computer use	Whether a multimodal agent can operate a real computer (desktop apps, file I/O, multi-app workflows) to complete open-ended tasks in a live virtual machine.	XLANG Lab, University of Hong Kong (Xie et al.)	2024	Active	See leaderboard
tau-bench Agents, tool use & computer use	Whether a tool-using agent can reliably complete customer-service tasks over multi-turn conversations with a simulated user while obeying domain policies.	Sierra (Yao, Shinn, Narasimhan et al.)	2024	Active	See leaderboard
VisualWebArena Agents, tool use & computer use	Whether a multimodal agent can complete visually grounded web tasks that require interpreting images and page layout, not just text.	Carnegie Mellon University (Koh et al.)	2024	Active	See leaderboard
WebArena Agents, tool use & computer use	Whether an autonomous agent can complete long-horizon, realistic web tasks (navigation, forms, multi-step workflows) in fully functional self-hosted websites.	Carnegie Mellon University (Zhou, Xu et al.)	2023	Active	See leaderboard
ARC-AGI-1 Reasoning & abstraction	Whether a system can infer the abstract rule of a novel visual grid puzzle from a few examples and apply it to a new input.	Francois Chollet (ARC Prize Foundation)	2019	Active	87.5% (high compute) OpenAI o3-preview
ARC-AGI-2 Reasoning & abstraction	The same fluid-intelligence test as ARC-AGI-1, but with harder, contamination-resistant tasks that stay easy for humans yet very hard for AI.	ARC Prize Foundation (Chollet et al.)	2025	Active	54% (verified) Poetiq (Gemini-based solver)
BIG-Bench Hard Reasoning & abstraction	A suite of multi-step reasoning tasks (logic, arithmetic, algorithmic, commonsense) on which pre-2022 models trailed average human raters.	Suzgun et al. (Google Research and Stanford)	2022	Saturated	See leaderboard
GPQA Diamond Reasoning & abstraction	Graduate and PhD-level multiple-choice scientific reasoning in biology, physics and chemistry, on questions designed to be unanswerable by quick web search.	Rein et al. (NYU, Cohere, Anthropic)	2023	Saturated	~94% Gemini 3.1 Pro Preview
Humanity's Last Exam Reasoning & abstraction	Frontier, closed-ended expert knowledge and reasoning across more than 100 academic disciplines at the limit of human expertise.	Center for AI Safety (CAIS) and Scale AI	2025	Active	53.3% Claude Fable 5 (Max Effort)
MuSR Reasoning & abstraction	Multistep commonsense reasoning embedded in long natural-language narratives such as murder mysteries, object placement and team allocation.	Sprague, Ye, Durrett et al. (UT Austin)	2023	Active	See leaderboard
AIME 2025 Mathematics	Olympiad-track competition mathematics at the level of the American Invitational Mathematics Examination, used as a high-difficulty LLM eval.	Mathematical Association of America; adopted as an LLM eval by the community	2025	Saturated	100% Multiple frontier reasoning models
FrontierMath Mathematics	Research-level original mathematics requiring hours to days of expert effort, across number theory, analysis, algebraic geometry and more.	Epoch AI	2024	Active	52.4% GPT-5.5 Pro
GSM8K Mathematics	Multi-step grade-school arithmetic word-problem reasoning.	OpenAI (Cobbe et al.)	2021	Saturated	~99.6% Frontier models broadly
MATH Mathematics	Step-by-step solving of high-school competition mathematics across algebra, geometry, number theory, probability and precalculus.	Hendrycks et al. (UC Berkeley)	2021	Saturated	~99% (MATH-500) GPT-5
MathArena Mathematics	Mathematical reasoning and proof-writing on freshly released competition problems, evaluated before they can enter training data.	ETH Zurich (SRI Lab)	2025	Active	81.1% (aggregate) GPT-5.5 (xhigh)
Omni-MATH Mathematics	Olympiad-level mathematical reasoning across a broad range of subdomains and difficulty levels.	Gao, Song, Cai et al. (Peking University and collaborators)	2024	Active	See leaderboard
MMLU Knowledge & general QA	Broad academic and professional knowledge across 57 subjects via four-choice multiple-choice questions.	Hendrycks et al. (UC Berkeley and collaborators)	2021	Saturated	~93% Qwen3.7 Max
MMLU-Pro Knowledge & general QA	Harder multi-task reasoning and knowledge designed to de-saturate MMLU and reward deliberate reasoning over recall.	TIGER-Lab (Wang et al., University of Waterloo)	2024	Active	~90% Gemini 3 Pro Preview
MMLU-Redux Knowledge & general QA	A re-annotated, error-corrected subset of MMLU used to measure true knowledge accuracy without the original's label noise.	Gema et al. (University of Edinburgh and collaborators)	2024	Active	See leaderboard
SimpleQA Knowledge & general QA	Short-form parametric factuality: whether a model answers single-answer fact-seeking questions correctly and abstains when unsure.	OpenAI (Wei, Karina et al.)	2024	Active	See leaderboard
LongBench Long context & retrieval	Comprehensive long-context understanding across realistic tasks (QA, summarization, few-shot, code, synthetic) in English and Chinese.	Tsinghua University (THUDM; Bai et al.)	2023	Active	57.7% (v2, with reasoning) o1-preview
MRCR Long context & retrieval	Whether a model can distinguish and retrieve the correct one among multiple near-identical requests buried in a long multi-turn conversation.	Google DeepMind (Michelangelo); open-source variant by OpenAI	2024	Active	See leaderboard
Needle-in-a-Haystack Long context & retrieval	Whether a model can recall a single planted fact (the needle) inserted at varying depths within a long context (the haystack).	Greg Kamradt (independent)	2023	Saturated	See leaderboard
NoLiMa Long context & retrieval	Long-context retrieval and reasoning when the question and the target fact share minimal literal word overlap, forcing latent association rather than keyword matching.	Adobe Research and LMU Munich (Modarressi et al.)	2025	Active	See leaderboard
RULER Long context & retrieval	The real effective context length of a model by testing retrieval, multi-hop tracing, aggregation and QA at increasing sequence lengths.	NVIDIA (Hsieh, Sun et al.)	2024	Active	See leaderboard
MathVista Multimodal & vision	Mathematical and quantitative reasoning grounded in visual contexts such as figures, charts, geometry and scientific diagrams.	Lu et al. (UCLA, University of Washington, Microsoft Research)	2023	Active	~91% (testmini) Seed 2.1 Pro
MMMU Multimodal & vision	College-level multimodal understanding and reasoning over images, diagrams, charts and text across many disciplines.	MMMU team (Yue et al.)	2023	Active	~86% Qwen3.6 Plus
MMMU-Pro Multimodal & vision	A harder, contamination-resistant version of MMMU that forces genuine visual reasoning rather than text-only shortcuts.	MMMU team (Yue et al.)	2024	Active	~84% Gemini 3.5 Flash
Video-MME Multimodal & vision	Comprehensive video understanding by multimodal LLMs across short, medium and long clips.	MME-Benchmarks team (Fu et al.)	2024	Active	~89% Seed 2.1 Pro
Artificial Analysis Intelligence Index Human preference & holistic	A composite index of overall model intelligence aggregating performance across reasoning, coding, knowledge, science and agentic tasks.	Artificial Analysis (independent)	2024	Active	~60 (index) Claude Fable 5
HELM Human preference & holistic	Multi-metric holistic evaluation across many scenarios, reporting accuracy alongside calibration, robustness, fairness, bias, toxicity and efficiency.	Stanford CRFM (Liang, Bommasani et al.)	2022	Active	See leaderboard
LMArena Human preference & holistic	Crowdsourced human preference between two anonymized model responses, aggregated into a relative ranking, not an objective capability.	LMArena (formerly LMSYS; Zheng, Chiang et al.)	2023	Active	~1510 Elo Claude Opus 4.8
MT-Bench Human preference & holistic	Instruction-following and conversational quality on multi-turn prompts, scored automatically by a strong LLM judge.	LMSYS (Zheng et al., UC Berkeley)	2023	Saturated	See leaderboard
HaluEval Safety, hallucination & factuality	A model's ability to recognize hallucinated content across question answering, knowledge-grounded dialogue and summarization.	Li et al. (Renmin University of China)	2023	Active	See leaderboard
TruthfulQA Safety, hallucination & factuality	Whether a model avoids repeating common human misconceptions when answering questions, rather than imitating popular falsehoods.	Lin, Hilton, Evans (Oxford and OpenAI)	2021	Active	See leaderboard
Vectara Hallucination Leaderboard Safety, hallucination & factuality	How often a model introduces unsupported content when summarizing a provided source document, i.e. faithfulness in closed-book summarization.	Vectara (Hughes et al.)	2023	Active	1.8% (lower is better) antgroup/finix-s1-32b

Top scores are representative snapshots as of June 2026, not live readings: leaderboards move constantly, and many figures come from each benchmark's own board (linked on the name). Where no clean current top score could be confirmed from a primary source, the cell reads "See leaderboard." Confirm at the source before quoting a number.

How to read a benchmark score

A headline score is only as good as the benchmark behind it. Four failure modes decide whether a number means anything, and the strongest benchmarks are the ones that resist all four. The scorecard below rates the most-quoted benchmarks on each concern, where higher always means a less trustworthy score. For the full argument and the receipts, read are AI benchmarks reliable and our breakdown of benchmark contamination.

Benchmark Trust Scorecard. Higher concern means a less trustworthy score.
Benchmark	Contamination	Saturation	Gameability	Real-world gap
MMLU (knowledge MCQ)	High	High	High	High
SWE-bench Verified (real GitHub fixes)	High	High	High	Medium
LMArena (human preference)	Medium	Low	High	High
GPQA Diamond (PhD-level science)	Medium	High	Medium	Medium
ARC-AGI v2 (abstract reasoning)	Low	Low	Medium	High
FrontierMath (research math)	Medium	Medium	Low	Medium
Terminal-Bench (terminal tasks)	Low	Medium	Low	Low

The most-quoted benchmarks rated on four concerns, where higher means a less trustworthy score. The newest, contamination-resistant designs (ARC-AGI v2, Terminal-Bench) sit greenest; the oldest public sets (MMLU) sit reddest.Source: Capital & Compute benchmark trust scorecard, June 29, 2026

Which benchmark should you actually watch?

Pick by the decision you are making, not by whichever number a vendor leads with.

Choosing a coding model: read a completion benchmark and a quality benchmark together. DeepSWE measures whether an agent finishes the task; FrontierCode measures whether you would merge its code, and the same model can ace one while failing the other. Add Terminal-Bench for agentic, run-it-in-a-real-terminal work.
Judging raw reasoning: watch the unsaturated sets. ARC-AGI-2 (abstract reasoning), Humanity’s Last Exam (expert breadth) and FrontierMath (research math) are still far from solved, so movement there is real progress, not noise.
Comparing general capability: a human-preference ranking like LMArena is the closest to "which feels better to use," but it rewards style as much as substance, so pair it with a composite index and a hard reasoning score.
Ignore the saturated ones: MMLU, GSM8K, MATH and HumanEval are quoted out of habit. When every frontier model scores above 95%, the benchmark is measuring the ceiling, not the model.

For how these benchmarks rank the current crop of agents, see how the 2026 coding-agent benchmarks actually rank, the coding agents that wrap these models, and the value leaderboard for points-per-dollar.

Why this is a snapshot, not a live feed

Benchmark scores are the most volatile, most gamed layer of model marketing. Leaderboards re-rank weekly, labs report only the tests they win, and the same benchmark can be run under different scaffolds that move the number by ten points or more. This directory is therefore a representative map as of June 29, 2026, built to show what each benchmark means and ground it to its own source, not to quote a score you can hold anyone to. The durable value is the left of the table (what it measures, who built it, whether it is still meaningful); the top-score column is a pointer to the live leaderboard, where you should always confirm before quoting a number.

Frequently asked questions

What is an AI benchmark?: An AI benchmark is a standardized test made of a fixed dataset, a task specification, and a scoring metric, used to measure and compare how well AI models perform a specific skill such as reasoning, coding, math, or knowledge. Running many models through the same test produces a single comparable score, which is what model leaderboards rank.
What are the main AI benchmarks in 2026?: For coding, SWE-bench Verified, DeepSWE, and Terminal-Bench. For reasoning, GPQA Diamond, ARC-AGI-2, and Humanity’s Last Exam. For math, FrontierMath and AIME. For broad knowledge, MMLU-Pro. For multimodal, MMMU. And for overall human preference, the LMArena Elo ranking. Many older benchmarks like MMLU, GSM8K, and HumanEval are now saturated and quoted mainly out of habit.
Are AI benchmarks reliable?: Partly. A benchmark is reliable only as far as its score reflects real capability, and four things erode that: contamination (test data leaking into training), saturation (top models bunched near the ceiling), gameability (a score inflated without real skill), and vendor cherry-picking (a lab reporting only the benchmarks it wins). The most trustworthy benchmarks are contamination-resistant, unsaturated, and run by an independent party. For the full breakdown, see our guide on whether AI benchmarks are reliable.
What does it mean when a benchmark is saturated?: A benchmark is saturated when the strongest models all score near its ceiling, so the differences between them are within noise and the benchmark no longer separates a better model from a worse one. MMLU, GSM8K, MATH, and HumanEval are all saturated in 2026, with top models above 95%, which is why the field keeps building harder replacements like MMLU-Pro and ARC-AGI-2.
What is the difference between SWE-bench and SWE-bench Verified?: SWE-bench is the original 2,294-task set of real GitHub issues. SWE-bench Verified is a 500-task subset that OpenAI and the SWE-bench authors hand-checked in 2024 to remove broken tests and unsolvable issues, so it is the cleaner, more-quoted version. By 2026 even Verified is treated as saturated and contamination-prone, and OpenAI now recommends the harder SWE-bench Pro instead.
Which AI benchmark matters most for coding?: There is no single one, because they measure different things. SWE-bench Verified and DeepSWE measure whether an agent can resolve a real issue; Terminal-Bench measures whether it can operate a real terminal end to end; FrontierCode measures whether the code is clean enough to merge; LiveCodeBench measures contamination-free competitive programming. For shipping production code, read a completion benchmark and a quality benchmark together rather than trusting one number.
What is benchmark contamination?: Contamination is when a benchmark’s questions or answers leak into a model’s training data, so the model can recall the answer instead of reasoning it out. It inflates scores without reflecting real capability and is the main reason public, static benchmarks decay over time. The defenses are private or held-out test sets, time-stamped problems released after a model’s cutoff, and freshly generated tasks.
What is GPQA Diamond, and why is it hard?: GPQA Diamond is a 198-question set of graduate and PhD-level biology, physics, and chemistry questions written by domain experts to be Google-proof, meaning a non-expert with web access still cannot answer them quickly. It tests reasoning over retrieval. By 2026 top models exceed the roughly 70% human-expert baseline and sit in the low-to-mid 90s, so it is now largely saturated.

Sources

Each benchmark is grounded to its primary source: the original paper, the project repository, or the official leaderboard, verified June 29, 2026. Top scores are representative snapshots from those sources:

Aider Polyglot (2024). Aider (Paul Gauthier). aider.chat/2024/12/21/polyglot.html
BigCodeBench (2024). BigCode project (Zhuo et al.). arxiv.org/abs/2406.15877
DeepSWE (2026). Datacurve. github.com/datacurve-ai/deep-swe
FrontierCode (2026). Cognition (with 20+ open-source maintainers). cognition.com/blog/frontier-code
HumanEval (2021). OpenAI (Chen et al.). arxiv.org/abs/2107.03374
LiveCodeBench (2024). UC Berkeley, MIT and Cornell (Jain, Han et al.). arxiv.org/abs/2403.07974
MBPP (2021). Google Research (Austin, Odena et al.). arxiv.org/abs/2108.07732
Multi-SWE-bench (2025). ByteDance (ByteDance Seed). arxiv.org/abs/2504.02605
RepoBench (2023). Liu, Xu and McAuley (UC San Diego). arxiv.org/abs/2306.03091
SWE-bench (2023). Princeton and Stanford (Jimenez, Yang, Yao et al.). arxiv.org/abs/2310.06770
SWE-bench Multimodal (2024). Stanford and Princeton (Yang, Jimenez et al.). arxiv.org/abs/2410.03859
SWE-bench Pro (2025). Scale AI (Scale Labs). arxiv.org/abs/2509.16941
SWE-bench Verified (2024). OpenAI (with the SWE-bench authors). openai.com/index/introducing-swe-bench-verified/
SWE-Lancer (2025). OpenAI (Miserendino, Patwardhan et al.). arxiv.org/abs/2502.12115
Terminal-Bench (2026). Stanford and the Laude Institute. arxiv.org/abs/2601.11868
AgentBench (2023). Tsinghua University (THUDM; Liu et al.). arxiv.org/abs/2308.03688
BrowseComp (2025). OpenAI (Wei, Sun et al.). arxiv.org/abs/2504.12516
GAIA (2023). Meta AI and Hugging Face (Mialon, Fourrier et al.). arxiv.org/abs/2311.12983
MLE-bench (2024). OpenAI (Chan et al.). arxiv.org/abs/2410.07095
OSWorld (2024). XLANG Lab, University of Hong Kong (Xie et al.). arxiv.org/abs/2404.07972
tau-bench (2024). Sierra (Yao, Shinn, Narasimhan et al.). arxiv.org/abs/2406.12045
VisualWebArena (2024). Carnegie Mellon University (Koh et al.). arxiv.org/abs/2401.13649
WebArena (2023). Carnegie Mellon University (Zhou, Xu et al.). arxiv.org/abs/2307.13854
ARC-AGI-1 (2019). Francois Chollet (ARC Prize Foundation). arcprize.org/arc-agi/1
ARC-AGI-2 (2025). ARC Prize Foundation (Chollet et al.). arcprize.org/arc-agi/2
BIG-Bench Hard (2022). Suzgun et al. (Google Research and Stanford). github.com/suzgunmirac/BIG-Bench-Hard
GPQA Diamond (2023). Rein et al. (NYU, Cohere, Anthropic). arxiv.org/abs/2311.12022
Humanity's Last Exam (2025). Center for AI Safety (CAIS) and Scale AI. arxiv.org/abs/2501.14249
MuSR (2023). Sprague, Ye, Durrett et al. (UT Austin). arxiv.org/abs/2310.16049
AIME 2025 (2025). Mathematical Association of America; adopted as an LLM eval by the community. matharena.ai/
FrontierMath (2024). Epoch AI. epoch.ai/frontiermath
GSM8K (2021). OpenAI (Cobbe et al.). arxiv.org/abs/2110.14168
MATH (2021). Hendrycks et al. (UC Berkeley). arxiv.org/abs/2103.03874
MathArena (2025). ETH Zurich (SRI Lab). arxiv.org/abs/2505.23281
Omni-MATH (2024). Gao, Song, Cai et al. (Peking University and collaborators). arxiv.org/abs/2410.07985
MMLU (2021). Hendrycks et al. (UC Berkeley and collaborators). arxiv.org/abs/2009.03300
MMLU-Pro (2024). TIGER-Lab (Wang et al., University of Waterloo). arxiv.org/abs/2406.01574
MMLU-Redux (2024). Gema et al. (University of Edinburgh and collaborators). arxiv.org/abs/2406.04127
SimpleQA (2024). OpenAI (Wei, Karina et al.). arxiv.org/abs/2411.04368
LongBench (2023). Tsinghua University (THUDM; Bai et al.). arxiv.org/abs/2412.15204
MRCR (2024). Google DeepMind (Michelangelo); open-source variant by OpenAI. arxiv.org/abs/2409.12640
Needle-in-a-Haystack (2023). Greg Kamradt (independent). github.com/gkamradt/LLMTest_NeedleInAHaystack
NoLiMa (2025). Adobe Research and LMU Munich (Modarressi et al.). arxiv.org/abs/2502.05167
RULER (2024). NVIDIA (Hsieh, Sun et al.). arxiv.org/abs/2404.06654
MathVista (2023). Lu et al. (UCLA, University of Washington, Microsoft Research). arxiv.org/abs/2310.02255
MMMU (2023). MMMU team (Yue et al.). arxiv.org/abs/2311.16502
MMMU-Pro (2024). MMMU team (Yue et al.). arxiv.org/abs/2409.02813
Video-MME (2024). MME-Benchmarks team (Fu et al.). arxiv.org/abs/2405.21075
Artificial Analysis Intelligence Index (2024). Artificial Analysis (independent). artificialanalysis.ai/methodology/intelligence-benchmarking
HELM (2022). Stanford CRFM (Liang, Bommasani et al.). arxiv.org/abs/2211.09110
LMArena (2023). LMArena (formerly LMSYS; Zheng, Chiang et al.). arxiv.org/abs/2403.04132
MT-Bench (2023). LMSYS (Zheng et al., UC Berkeley). arxiv.org/abs/2306.05685
HaluEval (2023). Li et al. (Renmin University of China). arxiv.org/abs/2305.11747
TruthfulQA (2021). Lin, Hilton, Evans (Oxford and OpenAI). arxiv.org/abs/2109.07958
Vectara Hallucination Leaderboard (2023). Vectara (Hughes et al.). github.com/vectara/hallucination-leaderboard

Machine-readable data: /ai-benchmarks.json. Benchmark reliability ratings are from our benchmark trust scorecard.

← All tools & trackers