Skip to content
Capital & Compute
Directory· Updated June 29, 2026

AI benchmarks

Every model launch quotes a wall of benchmark names. This directory maps 55 of them across 9 categories, from the coding and agent tests builders watch most to reasoning, math, knowledge, long context, multimodal, human preference and safety: what each one measures, who built it, the year, how it is scored, a representative current top score, and a link to its leaderboard. For models ranked by value, see the value leaderboard; for why these scores are easier to trust some years than others, our guide on whether AI benchmarks are reliable.

What are the main AI benchmarks?

The most-watched AI benchmarks in 2026, by what they test, are:

  • SWE-bench Verified and DeepSWE: real-world coding agents
  • Terminal-Bench: agentic command-line tasks
  • GPQA Diamond: PhD-level science reasoning
  • Humanity’s Last Exam: hardest cross-domain reasoning
  • ARC-AGI-2: novel abstract reasoning
  • FrontierMath and AIME: competition and research math
  • MMLU-Pro: broad academic knowledge
  • LMArena: human-preference ranking (Elo)
  • MMMU: college-level multimodal understanding

Older sets like MMLU, GSM8K and HumanEval are now saturated, with top models above 95%, so they are quoted mainly out of habit.

55
Benchmarks mapped
Across 9 capability categories
15
Coding & agent tests
The deepest category, and the one builders watch most
12
Now saturated
Top models near the ceiling, so the score no longer separates them
2026
Newest entries
DeepSWE, FrontierCode, Terminal-Bench

What is an AI benchmark?

An AI benchmark is a standardized test, made of a fixed dataset, a task specification, and a scoring metric, used to measure and compare how well AI models perform a specific skill such as reasoning, coding, math, or knowledge. Running many models through the same test produces a single comparable number, which is what a model leaderboard ranks. The catch is that a benchmark only stays meaningful while it is hard: once the frontier clears it, or its answers leak into training data, the score stops telling good models from great ones, and the field has to build a harder one.

Benchmarks by category

The 55 benchmarks split into 9 categories. Coding and agents is the largest, both because it is the most commercially watched skill and because contamination forced builders to keep replacing the older sets. To search or filter all of them at once, use the directory tool below.

AI benchmarks tracked, by categoryA horizontal bar chart of how many benchmarks sit in each category. Coding and software-engineering agents leads with 15, followed by agents and tool use with 8, reasoning and math with 6 each, long context with 5, knowledge, multimodal and human preference with 4 each, and safety and factuality with 3.051015Coding15Agents8Reasoning6Mathematics6Knowledge4Long context5Multimodal4Human preference4Safety3
AI benchmarks tracked, by category
ItemValue
Coding15
Agents8
Reasoning6
Mathematics6
Knowledge4
Long context5
Multimodal4
Human preference4
Safety3
How the tracked benchmarks split across the nine categories. Coding and software-engineering agents is the deepest, reflecting both demand and the churn of contamination-driven replacements.Source: Capital & Compute AI benchmark directory, June 29, 2026

Coding & software-engineering agents

Whether a model can write, edit, and fix real code, increasingly as a multi-step agent working in a real repository. The most-watched category for AI builders, and where contamination bites hardest.

BenchmarkWhat it measuresMaker · YearStatusTop score
Aider PolyglotPercent correct after the second attempt, plus percent using the correct edit formatHow well a model writes and correctly edits code across many languages, including applying diffs in the right format and self-correcting after test failures.Aider2024ActiveSee leaderboard
BigCodeBenchpass@1 against rigorous per-task test suitesWhether models can write code that correctly invokes multiple function calls from diverse real libraries to satisfy complex, practical instructions.BigCode project2024ActiveSee leaderboard
DeepSWEpass@1 (committed code graded in a clean environment)Whether frontier coding agents can complete original, long-horizon engineering tasks written from scratch, with no upstream PR to memorize.Datacurve2026Active~70%Claude Fable 5 (max)
FrontierCodePass rate on blocker criteria plus a weighted six-dimension quality rubricWhether a coding agent produces a mergeable, production-quality pull request, not just one that passes tests, judged on correctness, regression safety, scope, tests and style.Cognition2026Active13.4% (Diamond)Claude Opus 4.8
HumanEvalpass@k (primarily pass@1)Whether a model can synthesize a single correct Python function from a docstring so that it passes the provided unit tests.OpenAI2021Saturated~99%Frontier models broadly
LiveCodeBenchpass@1Code generation and related skills (self-repair, execution, test-output prediction) on fresh competitive-programming problems, designed to be contamination-free.UC Berkeley, MIT and Cornell2024ActiveSee leaderboard
MBPPpass@1Whether a model can generate short, entry-level Python functions from a natural-language prompt that pass the provided tests.Google Research2021Saturated~95%+Frontier models broadly
Multi-SWE-bench% resolved (pass@1)Cross-language issue resolution: whether agents can resolve real GitHub issues with a passing patch across many languages beyond Python.ByteDance2025ActiveSee leaderboard
RepoBenchRetrieval accuracy and exact-match / edit similarity for next-line completionRepository-level code auto-completion: retrieving relevant cross-file context, predicting the next line, and the combined retrieval-plus-completion pipeline.Liu, Xu and McAuley2023ActiveSee leaderboard
SWE-bench% resolved (pass@1)Whether a system can resolve a real GitHub issue by generating a patch that passes the repository's hidden tests.Princeton and Stanford2023SaturatedSee leaderboard
SWE-bench Multimodal% resolved (pass@1)Whether coding agents can resolve real GitHub issues in visual, user-facing JavaScript software where the bug or feature involves the UI.Stanford and Princeton2024ActiveSee leaderboard
SWE-bench Pro% resolved (pass@1) under standardized agent scaffoldingWhether agents can solve long-horizon, enterprise-grade software-engineering tasks under standardized scaffolding, designed to resist contamination.Scale AI2025Active59.1% (public set)GPT-5.4 (xHigh)
SWE-bench Verified% resolved (pass@1)The same real-GitHub-issue resolution task as SWE-bench, restricted to a human-validated subset where the issue is solvable and the tests are not broken.OpenAI2024Saturated~95%Claude Fable 5
SWE-LancerDollars earned (and % of tasks resolved)Whether frontier models can complete real paid freelance software jobs, both coding and technical-management tasks, well enough to earn the payouts.OpenAI2025ActiveSee leaderboard
Terminal-BenchPass/fail, graded by verification scripts in the agent's Docker environment (pass@1)Whether an AI agent can complete hard, realistic command-line tasks (build, configure, train, debug, secure) end to end inside a real terminal.Stanford and the Laude Institute2026Active~82%Codex (GPT-5.5)

Agents, tool use & computer use

Whether a model can plan, call tools, browse, and operate a computer or website to finish open-ended tasks, not just answer a question in one shot.

BenchmarkWhat it measuresMaker · YearStatusTop score
AgentBenchPer-environment success aggregated into an overall scoreHow well an LLM acts as an autonomous agent in multi-turn, open-ended decision-making across diverse interactive environments.Tsinghua University2023ActiveSee leaderboard
BrowseCompAccuracy via model-graded semantic equivalence to the reference answerWhether a browsing agent can persistently navigate the open web to locate a single hard-to-find, entangled fact.OpenAI2025Active51.5%OpenAI Deep Research (launch paper)
GAIAExact-match accuracy against an unambiguous answerWhether an AI assistant can answer real-world questions that require multi-step reasoning, multiple modalities, web browsing and general tool use.Meta AI and Hugging Face2023Active~75%HAL agent (Claude Sonnet 4.5)
MLE-benchMedal rate (fraction of competitions reaching bronze/silver/gold thresholds)Whether an AI agent can do end-to-end machine-learning engineering (data prep, training, experimentation, submission) at the level of human Kaggle competitors.OpenAI2024Active16.9% (paper baseline)o1-preview with AIDE scaffolding
OSWorldExecution-based success rate via per-task verification scripts that inspect machine stateWhether a multimodal agent can operate a real computer (desktop apps, file I/O, multi-app workflows) to complete open-ended tasks in a live virtual machine.XLANG Lab, University of Hong Kong2024ActiveSee leaderboard
tau-benchpass^k: the probability an agent succeeds across all k independent trials (reliability, not just average success)Whether a tool-using agent can reliably complete customer-service tasks over multi-turn conversations with a simulated user while obeying domain policies.Sierra2024ActiveSee leaderboard
VisualWebArenaFunctional success rate via execution-based evaluationWhether a multimodal agent can complete visually grounded web tasks that require interpreting images and page layout, not just text.Carnegie Mellon University2024ActiveSee leaderboard
WebArenaFunctional success rate via execution-based reward checking the end stateWhether an autonomous agent can complete long-horizon, realistic web tasks (navigation, forms, multi-step workflows) in fully functional self-hosted websites.Carnegie Mellon University2023ActiveSee leaderboard

Reasoning & abstraction

Hard multi-step reasoning and fluid, novel problem-solving designed to resist memorization. The benchmarks the frontier is still far from solving.

BenchmarkWhat it measuresMaker · YearStatusTop score
ARC-AGI-1pass@2 exact-grid-match accuracyWhether a system can infer the abstract rule of a novel visual grid puzzle from a few examples and apply it to a new input.Francois Chollet2019Active87.5% (high compute)OpenAI o3-preview
ARC-AGI-2pass@2 exact-grid-match accuracy, reported with a cost-per-task efficiency metricThe same fluid-intelligence test as ARC-AGI-1, but with harder, contamination-resistant tasks that stay easy for humans yet very hard for AI.ARC Prize Foundation2025Active54% (verified)Poetiq (Gemini-based solver)
BIG-Bench HardPer-task accuracy averaged across the 23 tasksA suite of multi-step reasoning tasks (logic, arithmetic, algorithmic, commonsense) on which pre-2022 models trailed average human raters.Suzgun et al.2022SaturatedSee leaderboard
GPQA DiamondMultiple-choice accuracy (random baseline 25%, PhD-expert baseline about 70%)Graduate and PhD-level multiple-choice scientific reasoning in biology, physics and chemistry, on questions designed to be unanswerable by quick web search.Rein et al.2023Saturated~94%Gemini 3.1 Pro Preview
Humanity's Last ExamAccuracy (exact match / multiple-choice), often reported with a calibration metricFrontier, closed-ended expert knowledge and reasoning across more than 100 academic disciplines at the limit of human expertise.Center for AI Safety (CAIS) and Scale AI2025Active53.3%Claude Fable 5 (Max Effort)
MuSRMultiple-choice accuracyMultistep commonsense reasoning embedded in long natural-language narratives such as murder mysteries, object placement and team allocation.Sprague, Ye, Durrett et al.2023ActiveSee leaderboard

Mathematics

From grade-school word problems to research-level proofs. The older sets are saturated; the newest are held back from the public to stay contamination-resistant.

BenchmarkWhat it measuresMaker · YearStatusTop score
AIME 2025Exact-match accuracy, usually pass@1 averaged over samplesOlympiad-track competition mathematics at the level of the American Invitational Mathematics Examination, used as a high-difficulty LLM eval.Mathematical Association of America; adopted as an LLM eval by the community2025Saturated100%Multiple frontier reasoning models
FrontierMathAccuracy (fraction with a correct, automatically verifiable final answer)Research-level original mathematics requiring hours to days of expert effort, across number theory, analysis, algebraic geometry and more.Epoch AI2024Active52.4%GPT-5.5 Pro
GSM8KExact-match accuracy on the final numeric answerMulti-step grade-school arithmetic word-problem reasoning.OpenAI2021Saturated~99.6%Frontier models broadly
MATHExact-match accuracy on the final boxed answerStep-by-step solving of high-school competition mathematics across algebra, geometry, number theory, probability and precalculus.Hendrycks et al.2021Saturated~99% (MATH-500)GPT-5
MathArenaPer-competition accuracy and an aggregate expected-performance scoreMathematical reasoning and proof-writing on freshly released competition problems, evaluated before they can enter training data.ETH Zurich2025Active81.1% (aggregate)GPT-5.5 (xhigh)
Omni-MATHAccuracy, scored with an LLM-based verifier (Omni-Judge)Olympiad-level mathematical reasoning across a broad range of subdomains and difficulty levels.Gao, Song, Cai et al.2024ActiveSee leaderboard

Knowledge & general QA

Broad academic and factual knowledge across domains, usually multiple-choice. The most-quoted and most-saturated family, now largely replaced by harder variants.

BenchmarkWhat it measuresMaker · YearStatusTop score
MMLUAccuracyBroad academic and professional knowledge across 57 subjects via four-choice multiple-choice questions.Hendrycks et al.2021Saturated~93%Qwen3.7 Max
MMLU-ProAccuracyHarder multi-task reasoning and knowledge designed to de-saturate MMLU and reward deliberate reasoning over recall.TIGER-Lab2024Active~90%Gemini 3 Pro Preview
MMLU-ReduxAccuracy on cleaned labelsA re-annotated, error-corrected subset of MMLU used to measure true knowledge accuracy without the original's label noise.Gema et al.2024ActiveSee leaderboard
SimpleQAAccuracy, plus correct-given-attempted and an F-score balancing attempts against accuracyShort-form parametric factuality: whether a model answers single-answer fact-seeking questions correctly and abstains when unsure.OpenAI2024ActiveSee leaderboard

Long context & retrieval

Whether a model can actually use a very long input, not just accept it: finding facts, resolving references, and reasoning across hundreds of thousands of tokens.

BenchmarkWhat it measuresMaker · YearStatusTop score
LongBenchv1: per-task automatic metrics. v2: multiple-choice accuracyComprehensive long-context understanding across realistic tasks (QA, summarization, few-shot, code, synthetic) in English and Chinese.Tsinghua University2023Active57.7% (v2, with reasoning)o1-preview
MRCRSimilarity of the model output to the target instance, gated by a required answer-prefixWhether a model can distinguish and retrieve the correct one among multiple near-identical requests buried in a long multi-turn conversation.Google DeepMind (Michelangelo); open-source variant by OpenAI2024ActiveSee leaderboard
Needle-in-a-HaystackRetrieval accuracy at each depth and length cellWhether a model can recall a single planted fact (the needle) inserted at varying depths within a long context (the haystack).Greg Kamradt2023SaturatedSee leaderboard
NoLiMaAccuracy at each length, relative to the model's short-context baselineLong-context retrieval and reasoning when the question and the target fact share minimal literal word overlap, forcing latent association rather than keyword matching.Adobe Research and LMU Munich2025ActiveSee leaderboard
RULERWeighted-average accuracy across tasks and lengths; effective length is the longest length still above thresholdThe real effective context length of a model by testing retrieval, multi-hop tracing, aggregation and QA at increasing sequence lengths.NVIDIA2024ActiveSee leaderboard

Multimodal & vision

Reasoning over images, charts, documents, and video alongside text. The frontier for models that see, not just read.

BenchmarkWhat it measuresMaker · YearStatusTop score
MathVistaAccuracyMathematical and quantitative reasoning grounded in visual contexts such as figures, charts, geometry and scientific diagrams.Lu et al.2023Active~91% (testmini)Seed 2.1 Pro
MMMUAccuracyCollege-level multimodal understanding and reasoning over images, diagrams, charts and text across many disciplines.MMMU team2023Active~86%Qwen3.6 Plus
MMMU-ProAccuracyA harder, contamination-resistant version of MMMU that forces genuine visual reasoning rather than text-only shortcuts.MMMU team2024Active~84%Gemini 3.5 Flash
Video-MMEAccuracy (tested with and without subtitles)Comprehensive video understanding by multimodal LLMs across short, medium and long clips.MME-Benchmarks team2024Active~89%Seed 2.1 Pro

Human preference & holistic

Aggregate and head-to-head measures: human-voted arenas, composite indices, and multi-metric frameworks that rank overall capability rather than one skill.

BenchmarkWhat it measuresMaker · YearStatusTop score
Artificial Analysis Intelligence IndexComposite index score (0 to 100 aggregate)A composite index of overall model intelligence aggregating performance across reasoning, coding, knowledge, science and agentic tasks.Artificial Analysis2024Active~60 (index)Claude Fable 5
HELMMulti-metric (per-metric scores across scenarios; no single headline number)Multi-metric holistic evaluation across many scenarios, reporting accuracy alongside calibration, robustness, fairness, bias, toxicity and efficiency.Stanford CRFM2022ActiveSee leaderboard
LMArenaElo / Bradley-Terry pairwise rating (an Arena Score)Crowdsourced human preference between two anonymized model responses, aggregated into a relative ranking, not an objective capability.LMArena2023Active~1510 EloClaude Opus 4.8
MT-BenchLLM-as-judge score (1 to 10 scale, averaged)Instruction-following and conversational quality on multi-turn prompts, scored automatically by a strong LLM judge.LMSYS2023SaturatedSee leaderboard

Safety, hallucination & factuality

Whether a model tells the truth and resists making things up. Measures honesty and hallucination rate, not raw capability.

BenchmarkWhat it measuresMaker · YearStatusTop score
HaluEvalHallucination-recognition accuracy (faithful vs hallucinated)A model's ability to recognize hallucinated content across question answering, knowledge-grounded dialogue and summarization.Li et al.2023ActiveSee leaderboard
TruthfulQA% truthful (and % truthful-and-informative)Whether a model avoids repeating common human misconceptions when answering questions, rather than imitating popular falsehoods.Lin, Hilton, Evans2021ActiveSee leaderboard
Vectara Hallucination LeaderboardHallucination rate (% of summaries judged unfaithful; lower is better)How often a model introduces unsupported content when summarizing a provided source document, i.e. faithfulness in closed-book summarization.Vectara2023Active1.8% (lower is better)antgroup/finix-s1-32b

Search and filter all 55 benchmarks

Filter by category or status, or search by name, alias, what a benchmark measures, or who built it.

Search the directory

Filter all 55 benchmarks by category or status, or search by name, alias, what it measures, or who built it.

55 shown
BenchmarkWhat it measuresMakerYearStatusTop score
Aider Polyglot Coding & software-engineering agentsHow well a model writes and correctly edits code across many languages, including applying diffs in the right format and self-correcting after test failures.Aider (Paul Gauthier)2024ActiveSee leaderboard
BigCodeBench Coding & software-engineering agentsWhether models can write code that correctly invokes multiple function calls from diverse real libraries to satisfy complex, practical instructions.BigCode project (Zhuo et al.)2024ActiveSee leaderboard
DeepSWE Coding & software-engineering agentsWhether frontier coding agents can complete original, long-horizon engineering tasks written from scratch, with no upstream PR to memorize.Datacurve2026Active~70% Claude Fable 5 (max)
FrontierCode Coding & software-engineering agentsWhether a coding agent produces a mergeable, production-quality pull request, not just one that passes tests, judged on correctness, regression safety, scope, tests and style.Cognition (with 20+ open-source maintainers)2026Active13.4% (Diamond) Claude Opus 4.8
HumanEval Coding & software-engineering agentsWhether a model can synthesize a single correct Python function from a docstring so that it passes the provided unit tests.OpenAI (Chen et al.)2021Saturated~99% Frontier models broadly
LiveCodeBench Coding & software-engineering agentsCode generation and related skills (self-repair, execution, test-output prediction) on fresh competitive-programming problems, designed to be contamination-free.UC Berkeley, MIT and Cornell (Jain, Han et al.)2024ActiveSee leaderboard
MBPP Coding & software-engineering agentsWhether a model can generate short, entry-level Python functions from a natural-language prompt that pass the provided tests.Google Research (Austin, Odena et al.)2021Saturated~95%+ Frontier models broadly
Multi-SWE-bench Coding & software-engineering agentsCross-language issue resolution: whether agents can resolve real GitHub issues with a passing patch across many languages beyond Python.ByteDance (ByteDance Seed)2025ActiveSee leaderboard
RepoBench Coding & software-engineering agentsRepository-level code auto-completion: retrieving relevant cross-file context, predicting the next line, and the combined retrieval-plus-completion pipeline.Liu, Xu and McAuley (UC San Diego)2023ActiveSee leaderboard
SWE-bench Coding & software-engineering agentsWhether a system can resolve a real GitHub issue by generating a patch that passes the repository's hidden tests.Princeton and Stanford (Jimenez, Yang, Yao et al.)2023SaturatedSee leaderboard
SWE-bench Multimodal Coding & software-engineering agentsWhether coding agents can resolve real GitHub issues in visual, user-facing JavaScript software where the bug or feature involves the UI.Stanford and Princeton (Yang, Jimenez et al.)2024ActiveSee leaderboard
SWE-bench Pro Coding & software-engineering agentsWhether agents can solve long-horizon, enterprise-grade software-engineering tasks under standardized scaffolding, designed to resist contamination.Scale AI (Scale Labs)2025Active59.1% (public set) GPT-5.4 (xHigh)
SWE-bench Verified Coding & software-engineering agentsThe same real-GitHub-issue resolution task as SWE-bench, restricted to a human-validated subset where the issue is solvable and the tests are not broken.OpenAI (with the SWE-bench authors)2024Saturated~95% Claude Fable 5
SWE-Lancer Coding & software-engineering agentsWhether frontier models can complete real paid freelance software jobs, both coding and technical-management tasks, well enough to earn the payouts.OpenAI (Miserendino, Patwardhan et al.)2025ActiveSee leaderboard
Terminal-Bench Coding & software-engineering agentsWhether an AI agent can complete hard, realistic command-line tasks (build, configure, train, debug, secure) end to end inside a real terminal.Stanford and the Laude Institute2026Active~82% Codex (GPT-5.5)
AgentBench Agents, tool use & computer useHow well an LLM acts as an autonomous agent in multi-turn, open-ended decision-making across diverse interactive environments.Tsinghua University (THUDM; Liu et al.)2023ActiveSee leaderboard
BrowseComp Agents, tool use & computer useWhether a browsing agent can persistently navigate the open web to locate a single hard-to-find, entangled fact.OpenAI (Wei, Sun et al.)2025Active51.5% OpenAI Deep Research (launch paper)
GAIA Agents, tool use & computer useWhether an AI assistant can answer real-world questions that require multi-step reasoning, multiple modalities, web browsing and general tool use.Meta AI and Hugging Face (Mialon, Fourrier et al.)2023Active~75% HAL agent (Claude Sonnet 4.5)
MLE-bench Agents, tool use & computer useWhether an AI agent can do end-to-end machine-learning engineering (data prep, training, experimentation, submission) at the level of human Kaggle competitors.OpenAI (Chan et al.)2024Active16.9% (paper baseline) o1-preview with AIDE scaffolding
OSWorld Agents, tool use & computer useWhether a multimodal agent can operate a real computer (desktop apps, file I/O, multi-app workflows) to complete open-ended tasks in a live virtual machine.XLANG Lab, University of Hong Kong (Xie et al.)2024ActiveSee leaderboard
tau-bench Agents, tool use & computer useWhether a tool-using agent can reliably complete customer-service tasks over multi-turn conversations with a simulated user while obeying domain policies.Sierra (Yao, Shinn, Narasimhan et al.)2024ActiveSee leaderboard
VisualWebArena Agents, tool use & computer useWhether a multimodal agent can complete visually grounded web tasks that require interpreting images and page layout, not just text.Carnegie Mellon University (Koh et al.)2024ActiveSee leaderboard
WebArena Agents, tool use & computer useWhether an autonomous agent can complete long-horizon, realistic web tasks (navigation, forms, multi-step workflows) in fully functional self-hosted websites.Carnegie Mellon University (Zhou, Xu et al.)2023ActiveSee leaderboard
ARC-AGI-1 Reasoning & abstractionWhether a system can infer the abstract rule of a novel visual grid puzzle from a few examples and apply it to a new input.Francois Chollet (ARC Prize Foundation)2019Active87.5% (high compute) OpenAI o3-preview
ARC-AGI-2 Reasoning & abstractionThe same fluid-intelligence test as ARC-AGI-1, but with harder, contamination-resistant tasks that stay easy for humans yet very hard for AI.ARC Prize Foundation (Chollet et al.)2025Active54% (verified) Poetiq (Gemini-based solver)
BIG-Bench Hard Reasoning & abstractionA suite of multi-step reasoning tasks (logic, arithmetic, algorithmic, commonsense) on which pre-2022 models trailed average human raters.Suzgun et al. (Google Research and Stanford)2022SaturatedSee leaderboard
GPQA Diamond Reasoning & abstractionGraduate and PhD-level multiple-choice scientific reasoning in biology, physics and chemistry, on questions designed to be unanswerable by quick web search.Rein et al. (NYU, Cohere, Anthropic)2023Saturated~94% Gemini 3.1 Pro Preview
Humanity's Last Exam Reasoning & abstractionFrontier, closed-ended expert knowledge and reasoning across more than 100 academic disciplines at the limit of human expertise.Center for AI Safety (CAIS) and Scale AI2025Active53.3% Claude Fable 5 (Max Effort)
MuSR Reasoning & abstractionMultistep commonsense reasoning embedded in long natural-language narratives such as murder mysteries, object placement and team allocation.Sprague, Ye, Durrett et al. (UT Austin)2023ActiveSee leaderboard
AIME 2025 MathematicsOlympiad-track competition mathematics at the level of the American Invitational Mathematics Examination, used as a high-difficulty LLM eval.Mathematical Association of America; adopted as an LLM eval by the community2025Saturated100% Multiple frontier reasoning models
FrontierMath MathematicsResearch-level original mathematics requiring hours to days of expert effort, across number theory, analysis, algebraic geometry and more.Epoch AI2024Active52.4% GPT-5.5 Pro
GSM8K MathematicsMulti-step grade-school arithmetic word-problem reasoning.OpenAI (Cobbe et al.)2021Saturated~99.6% Frontier models broadly
MATH MathematicsStep-by-step solving of high-school competition mathematics across algebra, geometry, number theory, probability and precalculus.Hendrycks et al. (UC Berkeley)2021Saturated~99% (MATH-500) GPT-5
MathArena MathematicsMathematical reasoning and proof-writing on freshly released competition problems, evaluated before they can enter training data.ETH Zurich (SRI Lab)2025Active81.1% (aggregate) GPT-5.5 (xhigh)
Omni-MATH MathematicsOlympiad-level mathematical reasoning across a broad range of subdomains and difficulty levels.Gao, Song, Cai et al. (Peking University and collaborators)2024ActiveSee leaderboard
MMLU Knowledge & general QABroad academic and professional knowledge across 57 subjects via four-choice multiple-choice questions.Hendrycks et al. (UC Berkeley and collaborators)2021Saturated~93% Qwen3.7 Max
MMLU-Pro Knowledge & general QAHarder multi-task reasoning and knowledge designed to de-saturate MMLU and reward deliberate reasoning over recall.TIGER-Lab (Wang et al., University of Waterloo)2024Active~90% Gemini 3 Pro Preview
MMLU-Redux Knowledge & general QAA re-annotated, error-corrected subset of MMLU used to measure true knowledge accuracy without the original's label noise.Gema et al. (University of Edinburgh and collaborators)2024ActiveSee leaderboard
SimpleQA Knowledge & general QAShort-form parametric factuality: whether a model answers single-answer fact-seeking questions correctly and abstains when unsure.OpenAI (Wei, Karina et al.)2024ActiveSee leaderboard
LongBench Long context & retrievalComprehensive long-context understanding across realistic tasks (QA, summarization, few-shot, code, synthetic) in English and Chinese.Tsinghua University (THUDM; Bai et al.)2023Active57.7% (v2, with reasoning) o1-preview
MRCR Long context & retrievalWhether a model can distinguish and retrieve the correct one among multiple near-identical requests buried in a long multi-turn conversation.Google DeepMind (Michelangelo); open-source variant by OpenAI2024ActiveSee leaderboard
Needle-in-a-Haystack Long context & retrievalWhether a model can recall a single planted fact (the needle) inserted at varying depths within a long context (the haystack).Greg Kamradt (independent)2023SaturatedSee leaderboard
NoLiMa Long context & retrievalLong-context retrieval and reasoning when the question and the target fact share minimal literal word overlap, forcing latent association rather than keyword matching.Adobe Research and LMU Munich (Modarressi et al.)2025ActiveSee leaderboard
RULER Long context & retrievalThe real effective context length of a model by testing retrieval, multi-hop tracing, aggregation and QA at increasing sequence lengths.NVIDIA (Hsieh, Sun et al.)2024ActiveSee leaderboard
MathVista Multimodal & visionMathematical and quantitative reasoning grounded in visual contexts such as figures, charts, geometry and scientific diagrams.Lu et al. (UCLA, University of Washington, Microsoft Research)2023Active~91% (testmini) Seed 2.1 Pro
MMMU Multimodal & visionCollege-level multimodal understanding and reasoning over images, diagrams, charts and text across many disciplines.MMMU team (Yue et al.)2023Active~86% Qwen3.6 Plus
MMMU-Pro Multimodal & visionA harder, contamination-resistant version of MMMU that forces genuine visual reasoning rather than text-only shortcuts.MMMU team (Yue et al.)2024Active~84% Gemini 3.5 Flash
Video-MME Multimodal & visionComprehensive video understanding by multimodal LLMs across short, medium and long clips.MME-Benchmarks team (Fu et al.)2024Active~89% Seed 2.1 Pro
Artificial Analysis Intelligence Index Human preference & holisticA composite index of overall model intelligence aggregating performance across reasoning, coding, knowledge, science and agentic tasks.Artificial Analysis (independent)2024Active~60 (index) Claude Fable 5
HELM Human preference & holisticMulti-metric holistic evaluation across many scenarios, reporting accuracy alongside calibration, robustness, fairness, bias, toxicity and efficiency.Stanford CRFM (Liang, Bommasani et al.)2022ActiveSee leaderboard
LMArena Human preference & holisticCrowdsourced human preference between two anonymized model responses, aggregated into a relative ranking, not an objective capability.LMArena (formerly LMSYS; Zheng, Chiang et al.)2023Active~1510 Elo Claude Opus 4.8
MT-Bench Human preference & holisticInstruction-following and conversational quality on multi-turn prompts, scored automatically by a strong LLM judge.LMSYS (Zheng et al., UC Berkeley)2023SaturatedSee leaderboard
HaluEval Safety, hallucination & factualityA model's ability to recognize hallucinated content across question answering, knowledge-grounded dialogue and summarization.Li et al. (Renmin University of China)2023ActiveSee leaderboard
TruthfulQA Safety, hallucination & factualityWhether a model avoids repeating common human misconceptions when answering questions, rather than imitating popular falsehoods.Lin, Hilton, Evans (Oxford and OpenAI)2021ActiveSee leaderboard
Vectara Hallucination Leaderboard Safety, hallucination & factualityHow often a model introduces unsupported content when summarizing a provided source document, i.e. faithfulness in closed-book summarization.Vectara (Hughes et al.)2023Active1.8% (lower is better) antgroup/finix-s1-32b

Top scores are representative snapshots as of June 2026, not live readings: leaderboards move constantly, and many figures come from each benchmark's own board (linked on the name). Where no clean current top score could be confirmed from a primary source, the cell reads "See leaderboard." Confirm at the source before quoting a number.

How to read a benchmark score

A headline score is only as good as the benchmark behind it. Four failure modes decide whether a number means anything, and the strongest benchmarks are the ones that resist all four. The scorecard below rates the most-quoted benchmarks on each concern, where higher always means a less trustworthy score. For the full argument and the receipts, read are AI benchmarks reliable and our breakdown of benchmark contamination.

Benchmark Trust ScorecardEach major AI benchmark rated Low, Medium, or High on four concern axes (contamination, saturation, gameability, and real-world gap), where higher always means a less trustworthy score. Full ratings are in the data table below.Low concernMediumHigh concernBenchmarkContaminationSaturationGameabilityReal-world gapMMLUknowledge MCQHighHighHighHighSWE-bench Verifiedreal GitHub fixesHighHighHighMediumLMArenahuman preferenceMediumLowHighHighGPQA DiamondPhD-level scienceMediumHighMediumMediumARC-AGI v2abstract reasoningLowLowMediumHighFrontierMathresearch mathMediumMediumLowMediumTerminal-Benchterminal tasksLowMediumLowLow
Benchmark Trust Scorecard. Higher concern means a less trustworthy score.
BenchmarkContaminationSaturationGameabilityReal-world gap
MMLU (knowledge MCQ)HighHighHighHigh
SWE-bench Verified (real GitHub fixes)HighHighHighMedium
LMArena (human preference)MediumLowHighHigh
GPQA Diamond (PhD-level science)MediumHighMediumMedium
ARC-AGI v2 (abstract reasoning)LowLowMediumHigh
FrontierMath (research math)MediumMediumLowMedium
Terminal-Bench (terminal tasks)LowMediumLowLow
The most-quoted benchmarks rated on four concerns, where higher means a less trustworthy score. The newest, contamination-resistant designs (ARC-AGI v2, Terminal-Bench) sit greenest; the oldest public sets (MMLU) sit reddest.Source: Capital & Compute benchmark trust scorecard, June 29, 2026

Which benchmark should you actually watch?

Pick by the decision you are making, not by whichever number a vendor leads with.

  • Choosing a coding model: read a completion benchmark and a quality benchmark together. DeepSWE measures whether an agent finishes the task; FrontierCode measures whether you would merge its code, and the same model can ace one while failing the other. Add Terminal-Bench for agentic, run-it-in-a-real-terminal work.
  • Judging raw reasoning: watch the unsaturated sets. ARC-AGI-2 (abstract reasoning), Humanity’s Last Exam (expert breadth) and FrontierMath (research math) are still far from solved, so movement there is real progress, not noise.
  • Comparing general capability: a human-preference ranking like LMArena is the closest to "which feels better to use," but it rewards style as much as substance, so pair it with a composite index and a hard reasoning score.
  • Ignore the saturated ones: MMLU, GSM8K, MATH and HumanEval are quoted out of habit. When every frontier model scores above 95%, the benchmark is measuring the ceiling, not the model.

For how these benchmarks rank the current crop of agents, see how the 2026 coding-agent benchmarks actually rank, the coding agents that wrap these models, and the value leaderboard for points-per-dollar.

Why this is a snapshot, not a live feed

Benchmark scores are the most volatile, most gamed layer of model marketing. Leaderboards re-rank weekly, labs report only the tests they win, and the same benchmark can be run under different scaffolds that move the number by ten points or more. This directory is therefore a representative map as of June 29, 2026, built to show what each benchmark means and ground it to its own source, not to quote a score you can hold anyone to. The durable value is the left of the table (what it measures, who built it, whether it is still meaningful); the top-score column is a pointer to the live leaderboard, where you should always confirm before quoting a number.

Frequently asked questions

What is an AI benchmark?
An AI benchmark is a standardized test made of a fixed dataset, a task specification, and a scoring metric, used to measure and compare how well AI models perform a specific skill such as reasoning, coding, math, or knowledge. Running many models through the same test produces a single comparable score, which is what model leaderboards rank.
What are the main AI benchmarks in 2026?
For coding, SWE-bench Verified, DeepSWE, and Terminal-Bench. For reasoning, GPQA Diamond, ARC-AGI-2, and Humanity’s Last Exam. For math, FrontierMath and AIME. For broad knowledge, MMLU-Pro. For multimodal, MMMU. And for overall human preference, the LMArena Elo ranking. Many older benchmarks like MMLU, GSM8K, and HumanEval are now saturated and quoted mainly out of habit.
Are AI benchmarks reliable?
Partly. A benchmark is reliable only as far as its score reflects real capability, and four things erode that: contamination (test data leaking into training), saturation (top models bunched near the ceiling), gameability (a score inflated without real skill), and vendor cherry-picking (a lab reporting only the benchmarks it wins). The most trustworthy benchmarks are contamination-resistant, unsaturated, and run by an independent party. For the full breakdown, see our guide on whether AI benchmarks are reliable.
What does it mean when a benchmark is saturated?
A benchmark is saturated when the strongest models all score near its ceiling, so the differences between them are within noise and the benchmark no longer separates a better model from a worse one. MMLU, GSM8K, MATH, and HumanEval are all saturated in 2026, with top models above 95%, which is why the field keeps building harder replacements like MMLU-Pro and ARC-AGI-2.
What is the difference between SWE-bench and SWE-bench Verified?
SWE-bench is the original 2,294-task set of real GitHub issues. SWE-bench Verified is a 500-task subset that OpenAI and the SWE-bench authors hand-checked in 2024 to remove broken tests and unsolvable issues, so it is the cleaner, more-quoted version. By 2026 even Verified is treated as saturated and contamination-prone, and OpenAI now recommends the harder SWE-bench Pro instead.
Which AI benchmark matters most for coding?
There is no single one, because they measure different things. SWE-bench Verified and DeepSWE measure whether an agent can resolve a real issue; Terminal-Bench measures whether it can operate a real terminal end to end; FrontierCode measures whether the code is clean enough to merge; LiveCodeBench measures contamination-free competitive programming. For shipping production code, read a completion benchmark and a quality benchmark together rather than trusting one number.
What is benchmark contamination?
Contamination is when a benchmark’s questions or answers leak into a model’s training data, so the model can recall the answer instead of reasoning it out. It inflates scores without reflecting real capability and is the main reason public, static benchmarks decay over time. The defenses are private or held-out test sets, time-stamped problems released after a model’s cutoff, and freshly generated tasks.
What is GPQA Diamond, and why is it hard?
GPQA Diamond is a 198-question set of graduate and PhD-level biology, physics, and chemistry questions written by domain experts to be Google-proof, meaning a non-expert with web access still cannot answer them quickly. It tests reasoning over retrieval. By 2026 top models exceed the roughly 70% human-expert baseline and sit in the low-to-mid 90s, so it is now largely saturated.

Sources

Each benchmark is grounded to its primary source: the original paper, the project repository, or the official leaderboard, verified June 29, 2026. Top scores are representative snapshots from those sources:

Machine-readable data: /ai-benchmarks.json. Benchmark reliability ratings are from our benchmark trust scorecard.

← All tools & trackers