AI benchmarks
Every model launch quotes a wall of benchmark names. This directory maps 55 of them across 9 categories, from the coding and agent tests builders watch most to reasoning, math, knowledge, long context, multimodal, human preference and safety: what each one measures, who built it, the year, how it is scored, a representative current top score, and a link to its leaderboard. For models ranked by value, see the value leaderboard; for why these scores are easier to trust some years than others, our guide on whether AI benchmarks are reliable.
What are the main AI benchmarks?
The most-watched AI benchmarks in 2026, by what they test, are:
- SWE-bench Verified and DeepSWE: real-world coding agents
- Terminal-Bench: agentic command-line tasks
- GPQA Diamond: PhD-level science reasoning
- Humanity’s Last Exam: hardest cross-domain reasoning
- ARC-AGI-2: novel abstract reasoning
- FrontierMath and AIME: competition and research math
- MMLU-Pro: broad academic knowledge
- LMArena: human-preference ranking (Elo)
- MMMU: college-level multimodal understanding
Older sets like MMLU, GSM8K and HumanEval are now saturated, with top models above 95%, so they are quoted mainly out of habit.
What is an AI benchmark?
An AI benchmark is a standardized test, made of a fixed dataset, a task specification, and a scoring metric, used to measure and compare how well AI models perform a specific skill such as reasoning, coding, math, or knowledge. Running many models through the same test produces a single comparable number, which is what a model leaderboard ranks. The catch is that a benchmark only stays meaningful while it is hard: once the frontier clears it, or its answers leak into training data, the score stops telling good models from great ones, and the field has to build a harder one.
Benchmarks by category
The 55 benchmarks split into 9 categories. Coding and agents is the largest, both because it is the most commercially watched skill and because contamination forced builders to keep replacing the older sets. To search or filter all of them at once, use the directory tool below.
| Item | Value |
|---|---|
| Coding | 15 |
| Agents | 8 |
| Reasoning | 6 |
| Mathematics | 6 |
| Knowledge | 4 |
| Long context | 5 |
| Multimodal | 4 |
| Human preference | 4 |
| Safety | 3 |
Coding & software-engineering agents
Whether a model can write, edit, and fix real code, increasingly as a multi-step agent working in a real repository. The most-watched category for AI builders, and where contamination bites hardest.
| Benchmark | What it measures | Maker · Year | Status | Top score |
|---|---|---|---|---|
| Aider PolyglotPercent correct after the second attempt, plus percent using the correct edit format | How well a model writes and correctly edits code across many languages, including applying diffs in the right format and self-correcting after test failures. | Aider2024 | Active | See leaderboard |
| BigCodeBenchpass@1 against rigorous per-task test suites | Whether models can write code that correctly invokes multiple function calls from diverse real libraries to satisfy complex, practical instructions. | BigCode project2024 | Active | See leaderboard |
| DeepSWEpass@1 (committed code graded in a clean environment) | Whether frontier coding agents can complete original, long-horizon engineering tasks written from scratch, with no upstream PR to memorize. | Datacurve2026 | Active | ~70%Claude Fable 5 (max) |
| FrontierCodePass rate on blocker criteria plus a weighted six-dimension quality rubric | Whether a coding agent produces a mergeable, production-quality pull request, not just one that passes tests, judged on correctness, regression safety, scope, tests and style. | Cognition2026 | Active | 13.4% (Diamond)Claude Opus 4.8 |
| HumanEvalpass@k (primarily pass@1) | Whether a model can synthesize a single correct Python function from a docstring so that it passes the provided unit tests. | OpenAI2021 | Saturated | ~99%Frontier models broadly |
| LiveCodeBenchpass@1 | Code generation and related skills (self-repair, execution, test-output prediction) on fresh competitive-programming problems, designed to be contamination-free. | UC Berkeley, MIT and Cornell2024 | Active | See leaderboard |
| MBPPpass@1 | Whether a model can generate short, entry-level Python functions from a natural-language prompt that pass the provided tests. | Google Research2021 | Saturated | ~95%+Frontier models broadly |
| Multi-SWE-bench% resolved (pass@1) | Cross-language issue resolution: whether agents can resolve real GitHub issues with a passing patch across many languages beyond Python. | ByteDance2025 | Active | See leaderboard |
| RepoBenchRetrieval accuracy and exact-match / edit similarity for next-line completion | Repository-level code auto-completion: retrieving relevant cross-file context, predicting the next line, and the combined retrieval-plus-completion pipeline. | Liu, Xu and McAuley2023 | Active | See leaderboard |
| SWE-bench% resolved (pass@1) | Whether a system can resolve a real GitHub issue by generating a patch that passes the repository's hidden tests. | Princeton and Stanford2023 | Saturated | See leaderboard |
| SWE-bench Multimodal% resolved (pass@1) | Whether coding agents can resolve real GitHub issues in visual, user-facing JavaScript software where the bug or feature involves the UI. | Stanford and Princeton2024 | Active | See leaderboard |
| SWE-bench Pro% resolved (pass@1) under standardized agent scaffolding | Whether agents can solve long-horizon, enterprise-grade software-engineering tasks under standardized scaffolding, designed to resist contamination. | Scale AI2025 | Active | 59.1% (public set)GPT-5.4 (xHigh) |
| SWE-bench Verified% resolved (pass@1) | The same real-GitHub-issue resolution task as SWE-bench, restricted to a human-validated subset where the issue is solvable and the tests are not broken. | OpenAI2024 | Saturated | ~95%Claude Fable 5 |
| SWE-LancerDollars earned (and % of tasks resolved) | Whether frontier models can complete real paid freelance software jobs, both coding and technical-management tasks, well enough to earn the payouts. | OpenAI2025 | Active | See leaderboard |
| Terminal-BenchPass/fail, graded by verification scripts in the agent's Docker environment (pass@1) | Whether an AI agent can complete hard, realistic command-line tasks (build, configure, train, debug, secure) end to end inside a real terminal. | Stanford and the Laude Institute2026 | Active | ~82%Codex (GPT-5.5) |
Agents, tool use & computer use
Whether a model can plan, call tools, browse, and operate a computer or website to finish open-ended tasks, not just answer a question in one shot.
| Benchmark | What it measures | Maker · Year | Status | Top score |
|---|---|---|---|---|
| AgentBenchPer-environment success aggregated into an overall score | How well an LLM acts as an autonomous agent in multi-turn, open-ended decision-making across diverse interactive environments. | Tsinghua University2023 | Active | See leaderboard |
| BrowseCompAccuracy via model-graded semantic equivalence to the reference answer | Whether a browsing agent can persistently navigate the open web to locate a single hard-to-find, entangled fact. | OpenAI2025 | Active | 51.5%OpenAI Deep Research (launch paper) |
| GAIAExact-match accuracy against an unambiguous answer | Whether an AI assistant can answer real-world questions that require multi-step reasoning, multiple modalities, web browsing and general tool use. | Meta AI and Hugging Face2023 | Active | ~75%HAL agent (Claude Sonnet 4.5) |
| MLE-benchMedal rate (fraction of competitions reaching bronze/silver/gold thresholds) | Whether an AI agent can do end-to-end machine-learning engineering (data prep, training, experimentation, submission) at the level of human Kaggle competitors. | OpenAI2024 | Active | 16.9% (paper baseline)o1-preview with AIDE scaffolding |
| OSWorldExecution-based success rate via per-task verification scripts that inspect machine state | Whether a multimodal agent can operate a real computer (desktop apps, file I/O, multi-app workflows) to complete open-ended tasks in a live virtual machine. | XLANG Lab, University of Hong Kong2024 | Active | See leaderboard |
| tau-benchpass^k: the probability an agent succeeds across all k independent trials (reliability, not just average success) | Whether a tool-using agent can reliably complete customer-service tasks over multi-turn conversations with a simulated user while obeying domain policies. | Sierra2024 | Active | See leaderboard |
| VisualWebArenaFunctional success rate via execution-based evaluation | Whether a multimodal agent can complete visually grounded web tasks that require interpreting images and page layout, not just text. | Carnegie Mellon University2024 | Active | See leaderboard |
| WebArenaFunctional success rate via execution-based reward checking the end state | Whether an autonomous agent can complete long-horizon, realistic web tasks (navigation, forms, multi-step workflows) in fully functional self-hosted websites. | Carnegie Mellon University2023 | Active | See leaderboard |
Reasoning & abstraction
Hard multi-step reasoning and fluid, novel problem-solving designed to resist memorization. The benchmarks the frontier is still far from solving.
| Benchmark | What it measures | Maker · Year | Status | Top score |
|---|---|---|---|---|
| ARC-AGI-1pass@2 exact-grid-match accuracy | Whether a system can infer the abstract rule of a novel visual grid puzzle from a few examples and apply it to a new input. | Francois Chollet2019 | Active | 87.5% (high compute)OpenAI o3-preview |
| ARC-AGI-2pass@2 exact-grid-match accuracy, reported with a cost-per-task efficiency metric | The same fluid-intelligence test as ARC-AGI-1, but with harder, contamination-resistant tasks that stay easy for humans yet very hard for AI. | ARC Prize Foundation2025 | Active | 54% (verified)Poetiq (Gemini-based solver) |
| BIG-Bench HardPer-task accuracy averaged across the 23 tasks | A suite of multi-step reasoning tasks (logic, arithmetic, algorithmic, commonsense) on which pre-2022 models trailed average human raters. | Suzgun et al.2022 | Saturated | See leaderboard |
| GPQA DiamondMultiple-choice accuracy (random baseline 25%, PhD-expert baseline about 70%) | Graduate and PhD-level multiple-choice scientific reasoning in biology, physics and chemistry, on questions designed to be unanswerable by quick web search. | Rein et al.2023 | Saturated | ~94%Gemini 3.1 Pro Preview |
| Humanity's Last ExamAccuracy (exact match / multiple-choice), often reported with a calibration metric | Frontier, closed-ended expert knowledge and reasoning across more than 100 academic disciplines at the limit of human expertise. | Center for AI Safety (CAIS) and Scale AI2025 | Active | 53.3%Claude Fable 5 (Max Effort) |
| MuSRMultiple-choice accuracy | Multistep commonsense reasoning embedded in long natural-language narratives such as murder mysteries, object placement and team allocation. | Sprague, Ye, Durrett et al.2023 | Active | See leaderboard |
Mathematics
From grade-school word problems to research-level proofs. The older sets are saturated; the newest are held back from the public to stay contamination-resistant.
| Benchmark | What it measures | Maker · Year | Status | Top score |
|---|---|---|---|---|
| AIME 2025Exact-match accuracy, usually pass@1 averaged over samples | Olympiad-track competition mathematics at the level of the American Invitational Mathematics Examination, used as a high-difficulty LLM eval. | Mathematical Association of America; adopted as an LLM eval by the community2025 | Saturated | 100%Multiple frontier reasoning models |
| FrontierMathAccuracy (fraction with a correct, automatically verifiable final answer) | Research-level original mathematics requiring hours to days of expert effort, across number theory, analysis, algebraic geometry and more. | Epoch AI2024 | Active | 52.4%GPT-5.5 Pro |
| GSM8KExact-match accuracy on the final numeric answer | Multi-step grade-school arithmetic word-problem reasoning. | OpenAI2021 | Saturated | ~99.6%Frontier models broadly |
| MATHExact-match accuracy on the final boxed answer | Step-by-step solving of high-school competition mathematics across algebra, geometry, number theory, probability and precalculus. | Hendrycks et al.2021 | Saturated | ~99% (MATH-500)GPT-5 |
| MathArenaPer-competition accuracy and an aggregate expected-performance score | Mathematical reasoning and proof-writing on freshly released competition problems, evaluated before they can enter training data. | ETH Zurich2025 | Active | 81.1% (aggregate)GPT-5.5 (xhigh) |
| Omni-MATHAccuracy, scored with an LLM-based verifier (Omni-Judge) | Olympiad-level mathematical reasoning across a broad range of subdomains and difficulty levels. | Gao, Song, Cai et al.2024 | Active | See leaderboard |
Knowledge & general QA
Broad academic and factual knowledge across domains, usually multiple-choice. The most-quoted and most-saturated family, now largely replaced by harder variants.
| Benchmark | What it measures | Maker · Year | Status | Top score |
|---|---|---|---|---|
| MMLUAccuracy | Broad academic and professional knowledge across 57 subjects via four-choice multiple-choice questions. | Hendrycks et al.2021 | Saturated | ~93%Qwen3.7 Max |
| MMLU-ProAccuracy | Harder multi-task reasoning and knowledge designed to de-saturate MMLU and reward deliberate reasoning over recall. | TIGER-Lab2024 | Active | ~90%Gemini 3 Pro Preview |
| MMLU-ReduxAccuracy on cleaned labels | A re-annotated, error-corrected subset of MMLU used to measure true knowledge accuracy without the original's label noise. | Gema et al.2024 | Active | See leaderboard |
| SimpleQAAccuracy, plus correct-given-attempted and an F-score balancing attempts against accuracy | Short-form parametric factuality: whether a model answers single-answer fact-seeking questions correctly and abstains when unsure. | OpenAI2024 | Active | See leaderboard |
Long context & retrieval
Whether a model can actually use a very long input, not just accept it: finding facts, resolving references, and reasoning across hundreds of thousands of tokens.
| Benchmark | What it measures | Maker · Year | Status | Top score |
|---|---|---|---|---|
| LongBenchv1: per-task automatic metrics. v2: multiple-choice accuracy | Comprehensive long-context understanding across realistic tasks (QA, summarization, few-shot, code, synthetic) in English and Chinese. | Tsinghua University2023 | Active | 57.7% (v2, with reasoning)o1-preview |
| MRCRSimilarity of the model output to the target instance, gated by a required answer-prefix | Whether a model can distinguish and retrieve the correct one among multiple near-identical requests buried in a long multi-turn conversation. | Google DeepMind (Michelangelo); open-source variant by OpenAI2024 | Active | See leaderboard |
| Needle-in-a-HaystackRetrieval accuracy at each depth and length cell | Whether a model can recall a single planted fact (the needle) inserted at varying depths within a long context (the haystack). | Greg Kamradt2023 | Saturated | See leaderboard |
| NoLiMaAccuracy at each length, relative to the model's short-context baseline | Long-context retrieval and reasoning when the question and the target fact share minimal literal word overlap, forcing latent association rather than keyword matching. | Adobe Research and LMU Munich2025 | Active | See leaderboard |
| RULERWeighted-average accuracy across tasks and lengths; effective length is the longest length still above threshold | The real effective context length of a model by testing retrieval, multi-hop tracing, aggregation and QA at increasing sequence lengths. | NVIDIA2024 | Active | See leaderboard |
Multimodal & vision
Reasoning over images, charts, documents, and video alongside text. The frontier for models that see, not just read.
| Benchmark | What it measures | Maker · Year | Status | Top score |
|---|---|---|---|---|
| MathVistaAccuracy | Mathematical and quantitative reasoning grounded in visual contexts such as figures, charts, geometry and scientific diagrams. | Lu et al.2023 | Active | ~91% (testmini)Seed 2.1 Pro |
| MMMUAccuracy | College-level multimodal understanding and reasoning over images, diagrams, charts and text across many disciplines. | MMMU team2023 | Active | ~86%Qwen3.6 Plus |
| MMMU-ProAccuracy | A harder, contamination-resistant version of MMMU that forces genuine visual reasoning rather than text-only shortcuts. | MMMU team2024 | Active | ~84%Gemini 3.5 Flash |
| Video-MMEAccuracy (tested with and without subtitles) | Comprehensive video understanding by multimodal LLMs across short, medium and long clips. | MME-Benchmarks team2024 | Active | ~89%Seed 2.1 Pro |
Human preference & holistic
Aggregate and head-to-head measures: human-voted arenas, composite indices, and multi-metric frameworks that rank overall capability rather than one skill.
| Benchmark | What it measures | Maker · Year | Status | Top score |
|---|---|---|---|---|
| Artificial Analysis Intelligence IndexComposite index score (0 to 100 aggregate) | A composite index of overall model intelligence aggregating performance across reasoning, coding, knowledge, science and agentic tasks. | Artificial Analysis2024 | Active | ~60 (index)Claude Fable 5 |
| HELMMulti-metric (per-metric scores across scenarios; no single headline number) | Multi-metric holistic evaluation across many scenarios, reporting accuracy alongside calibration, robustness, fairness, bias, toxicity and efficiency. | Stanford CRFM2022 | Active | See leaderboard |
| LMArenaElo / Bradley-Terry pairwise rating (an Arena Score) | Crowdsourced human preference between two anonymized model responses, aggregated into a relative ranking, not an objective capability. | LMArena2023 | Active | ~1510 EloClaude Opus 4.8 |
| MT-BenchLLM-as-judge score (1 to 10 scale, averaged) | Instruction-following and conversational quality on multi-turn prompts, scored automatically by a strong LLM judge. | LMSYS2023 | Saturated | See leaderboard |
Safety, hallucination & factuality
Whether a model tells the truth and resists making things up. Measures honesty and hallucination rate, not raw capability.
| Benchmark | What it measures | Maker · Year | Status | Top score |
|---|---|---|---|---|
| HaluEvalHallucination-recognition accuracy (faithful vs hallucinated) | A model's ability to recognize hallucinated content across question answering, knowledge-grounded dialogue and summarization. | Li et al.2023 | Active | See leaderboard |
| TruthfulQA% truthful (and % truthful-and-informative) | Whether a model avoids repeating common human misconceptions when answering questions, rather than imitating popular falsehoods. | Lin, Hilton, Evans2021 | Active | See leaderboard |
| Vectara Hallucination LeaderboardHallucination rate (% of summaries judged unfaithful; lower is better) | How often a model introduces unsupported content when summarizing a provided source document, i.e. faithfulness in closed-book summarization. | Vectara2023 | Active | 1.8% (lower is better)antgroup/finix-s1-32b |
Search and filter all 55 benchmarks
Filter by category or status, or search by name, alias, what a benchmark measures, or who built it.
Search the directory
Filter all 55 benchmarks by category or status, or search by name, alias, what it measures, or who built it.
| Benchmark | What it measures | Maker | Year | Status | Top score |
|---|---|---|---|---|---|
| Aider Polyglot Coding & software-engineering agents | How well a model writes and correctly edits code across many languages, including applying diffs in the right format and self-correcting after test failures. | Aider (Paul Gauthier) | 2024 | Active | See leaderboard |
| BigCodeBench Coding & software-engineering agents | Whether models can write code that correctly invokes multiple function calls from diverse real libraries to satisfy complex, practical instructions. | BigCode project (Zhuo et al.) | 2024 | Active | See leaderboard |
| DeepSWE Coding & software-engineering agents | Whether frontier coding agents can complete original, long-horizon engineering tasks written from scratch, with no upstream PR to memorize. | Datacurve | 2026 | Active | ~70% Claude Fable 5 (max) |
| FrontierCode Coding & software-engineering agents | Whether a coding agent produces a mergeable, production-quality pull request, not just one that passes tests, judged on correctness, regression safety, scope, tests and style. | Cognition (with 20+ open-source maintainers) | 2026 | Active | 13.4% (Diamond) Claude Opus 4.8 |
| HumanEval Coding & software-engineering agents | Whether a model can synthesize a single correct Python function from a docstring so that it passes the provided unit tests. | OpenAI (Chen et al.) | 2021 | Saturated | ~99% Frontier models broadly |
| LiveCodeBench Coding & software-engineering agents | Code generation and related skills (self-repair, execution, test-output prediction) on fresh competitive-programming problems, designed to be contamination-free. | UC Berkeley, MIT and Cornell (Jain, Han et al.) | 2024 | Active | See leaderboard |
| MBPP Coding & software-engineering agents | Whether a model can generate short, entry-level Python functions from a natural-language prompt that pass the provided tests. | Google Research (Austin, Odena et al.) | 2021 | Saturated | ~95%+ Frontier models broadly |
| Multi-SWE-bench Coding & software-engineering agents | Cross-language issue resolution: whether agents can resolve real GitHub issues with a passing patch across many languages beyond Python. | ByteDance (ByteDance Seed) | 2025 | Active | See leaderboard |
| RepoBench Coding & software-engineering agents | Repository-level code auto-completion: retrieving relevant cross-file context, predicting the next line, and the combined retrieval-plus-completion pipeline. | Liu, Xu and McAuley (UC San Diego) | 2023 | Active | See leaderboard |
| SWE-bench Coding & software-engineering agents | Whether a system can resolve a real GitHub issue by generating a patch that passes the repository's hidden tests. | Princeton and Stanford (Jimenez, Yang, Yao et al.) | 2023 | Saturated | See leaderboard |
| SWE-bench Multimodal Coding & software-engineering agents | Whether coding agents can resolve real GitHub issues in visual, user-facing JavaScript software where the bug or feature involves the UI. | Stanford and Princeton (Yang, Jimenez et al.) | 2024 | Active | See leaderboard |
| SWE-bench Pro Coding & software-engineering agents | Whether agents can solve long-horizon, enterprise-grade software-engineering tasks under standardized scaffolding, designed to resist contamination. | Scale AI (Scale Labs) | 2025 | Active | 59.1% (public set) GPT-5.4 (xHigh) |
| SWE-bench Verified Coding & software-engineering agents | The same real-GitHub-issue resolution task as SWE-bench, restricted to a human-validated subset where the issue is solvable and the tests are not broken. | OpenAI (with the SWE-bench authors) | 2024 | Saturated | ~95% Claude Fable 5 |
| SWE-Lancer Coding & software-engineering agents | Whether frontier models can complete real paid freelance software jobs, both coding and technical-management tasks, well enough to earn the payouts. | OpenAI (Miserendino, Patwardhan et al.) | 2025 | Active | See leaderboard |
| Terminal-Bench Coding & software-engineering agents | Whether an AI agent can complete hard, realistic command-line tasks (build, configure, train, debug, secure) end to end inside a real terminal. | Stanford and the Laude Institute | 2026 | Active | ~82% Codex (GPT-5.5) |
| AgentBench Agents, tool use & computer use | How well an LLM acts as an autonomous agent in multi-turn, open-ended decision-making across diverse interactive environments. | Tsinghua University (THUDM; Liu et al.) | 2023 | Active | See leaderboard |
| BrowseComp Agents, tool use & computer use | Whether a browsing agent can persistently navigate the open web to locate a single hard-to-find, entangled fact. | OpenAI (Wei, Sun et al.) | 2025 | Active | 51.5% OpenAI Deep Research (launch paper) |
| GAIA Agents, tool use & computer use | Whether an AI assistant can answer real-world questions that require multi-step reasoning, multiple modalities, web browsing and general tool use. | Meta AI and Hugging Face (Mialon, Fourrier et al.) | 2023 | Active | ~75% HAL agent (Claude Sonnet 4.5) |
| MLE-bench Agents, tool use & computer use | Whether an AI agent can do end-to-end machine-learning engineering (data prep, training, experimentation, submission) at the level of human Kaggle competitors. | OpenAI (Chan et al.) | 2024 | Active | 16.9% (paper baseline) o1-preview with AIDE scaffolding |
| OSWorld Agents, tool use & computer use | Whether a multimodal agent can operate a real computer (desktop apps, file I/O, multi-app workflows) to complete open-ended tasks in a live virtual machine. | XLANG Lab, University of Hong Kong (Xie et al.) | 2024 | Active | See leaderboard |
| tau-bench Agents, tool use & computer use | Whether a tool-using agent can reliably complete customer-service tasks over multi-turn conversations with a simulated user while obeying domain policies. | Sierra (Yao, Shinn, Narasimhan et al.) | 2024 | Active | See leaderboard |
| VisualWebArena Agents, tool use & computer use | Whether a multimodal agent can complete visually grounded web tasks that require interpreting images and page layout, not just text. | Carnegie Mellon University (Koh et al.) | 2024 | Active | See leaderboard |
| WebArena Agents, tool use & computer use | Whether an autonomous agent can complete long-horizon, realistic web tasks (navigation, forms, multi-step workflows) in fully functional self-hosted websites. | Carnegie Mellon University (Zhou, Xu et al.) | 2023 | Active | See leaderboard |
| ARC-AGI-1 Reasoning & abstraction | Whether a system can infer the abstract rule of a novel visual grid puzzle from a few examples and apply it to a new input. | Francois Chollet (ARC Prize Foundation) | 2019 | Active | 87.5% (high compute) OpenAI o3-preview |
| ARC-AGI-2 Reasoning & abstraction | The same fluid-intelligence test as ARC-AGI-1, but with harder, contamination-resistant tasks that stay easy for humans yet very hard for AI. | ARC Prize Foundation (Chollet et al.) | 2025 | Active | 54% (verified) Poetiq (Gemini-based solver) |
| BIG-Bench Hard Reasoning & abstraction | A suite of multi-step reasoning tasks (logic, arithmetic, algorithmic, commonsense) on which pre-2022 models trailed average human raters. | Suzgun et al. (Google Research and Stanford) | 2022 | Saturated | See leaderboard |
| GPQA Diamond Reasoning & abstraction | Graduate and PhD-level multiple-choice scientific reasoning in biology, physics and chemistry, on questions designed to be unanswerable by quick web search. | Rein et al. (NYU, Cohere, Anthropic) | 2023 | Saturated | ~94% Gemini 3.1 Pro Preview |
| Humanity's Last Exam Reasoning & abstraction | Frontier, closed-ended expert knowledge and reasoning across more than 100 academic disciplines at the limit of human expertise. | Center for AI Safety (CAIS) and Scale AI | 2025 | Active | 53.3% Claude Fable 5 (Max Effort) |
| MuSR Reasoning & abstraction | Multistep commonsense reasoning embedded in long natural-language narratives such as murder mysteries, object placement and team allocation. | Sprague, Ye, Durrett et al. (UT Austin) | 2023 | Active | See leaderboard |
| AIME 2025 Mathematics | Olympiad-track competition mathematics at the level of the American Invitational Mathematics Examination, used as a high-difficulty LLM eval. | Mathematical Association of America; adopted as an LLM eval by the community | 2025 | Saturated | 100% Multiple frontier reasoning models |
| FrontierMath Mathematics | Research-level original mathematics requiring hours to days of expert effort, across number theory, analysis, algebraic geometry and more. | Epoch AI | 2024 | Active | 52.4% GPT-5.5 Pro |
| GSM8K Mathematics | Multi-step grade-school arithmetic word-problem reasoning. | OpenAI (Cobbe et al.) | 2021 | Saturated | ~99.6% Frontier models broadly |
| MATH Mathematics | Step-by-step solving of high-school competition mathematics across algebra, geometry, number theory, probability and precalculus. | Hendrycks et al. (UC Berkeley) | 2021 | Saturated | ~99% (MATH-500) GPT-5 |
| MathArena Mathematics | Mathematical reasoning and proof-writing on freshly released competition problems, evaluated before they can enter training data. | ETH Zurich (SRI Lab) | 2025 | Active | 81.1% (aggregate) GPT-5.5 (xhigh) |
| Omni-MATH Mathematics | Olympiad-level mathematical reasoning across a broad range of subdomains and difficulty levels. | Gao, Song, Cai et al. (Peking University and collaborators) | 2024 | Active | See leaderboard |
| MMLU Knowledge & general QA | Broad academic and professional knowledge across 57 subjects via four-choice multiple-choice questions. | Hendrycks et al. (UC Berkeley and collaborators) | 2021 | Saturated | ~93% Qwen3.7 Max |
| MMLU-Pro Knowledge & general QA | Harder multi-task reasoning and knowledge designed to de-saturate MMLU and reward deliberate reasoning over recall. | TIGER-Lab (Wang et al., University of Waterloo) | 2024 | Active | ~90% Gemini 3 Pro Preview |
| MMLU-Redux Knowledge & general QA | A re-annotated, error-corrected subset of MMLU used to measure true knowledge accuracy without the original's label noise. | Gema et al. (University of Edinburgh and collaborators) | 2024 | Active | See leaderboard |
| SimpleQA Knowledge & general QA | Short-form parametric factuality: whether a model answers single-answer fact-seeking questions correctly and abstains when unsure. | OpenAI (Wei, Karina et al.) | 2024 | Active | See leaderboard |
| LongBench Long context & retrieval | Comprehensive long-context understanding across realistic tasks (QA, summarization, few-shot, code, synthetic) in English and Chinese. | Tsinghua University (THUDM; Bai et al.) | 2023 | Active | 57.7% (v2, with reasoning) o1-preview |
| MRCR Long context & retrieval | Whether a model can distinguish and retrieve the correct one among multiple near-identical requests buried in a long multi-turn conversation. | Google DeepMind (Michelangelo); open-source variant by OpenAI | 2024 | Active | See leaderboard |
| Needle-in-a-Haystack Long context & retrieval | Whether a model can recall a single planted fact (the needle) inserted at varying depths within a long context (the haystack). | Greg Kamradt (independent) | 2023 | Saturated | See leaderboard |
| NoLiMa Long context & retrieval | Long-context retrieval and reasoning when the question and the target fact share minimal literal word overlap, forcing latent association rather than keyword matching. | Adobe Research and LMU Munich (Modarressi et al.) | 2025 | Active | See leaderboard |
| RULER Long context & retrieval | The real effective context length of a model by testing retrieval, multi-hop tracing, aggregation and QA at increasing sequence lengths. | NVIDIA (Hsieh, Sun et al.) | 2024 | Active | See leaderboard |
| MathVista Multimodal & vision | Mathematical and quantitative reasoning grounded in visual contexts such as figures, charts, geometry and scientific diagrams. | Lu et al. (UCLA, University of Washington, Microsoft Research) | 2023 | Active | ~91% (testmini) Seed 2.1 Pro |
| MMMU Multimodal & vision | College-level multimodal understanding and reasoning over images, diagrams, charts and text across many disciplines. | MMMU team (Yue et al.) | 2023 | Active | ~86% Qwen3.6 Plus |
| MMMU-Pro Multimodal & vision | A harder, contamination-resistant version of MMMU that forces genuine visual reasoning rather than text-only shortcuts. | MMMU team (Yue et al.) | 2024 | Active | ~84% Gemini 3.5 Flash |
| Video-MME Multimodal & vision | Comprehensive video understanding by multimodal LLMs across short, medium and long clips. | MME-Benchmarks team (Fu et al.) | 2024 | Active | ~89% Seed 2.1 Pro |
| Artificial Analysis Intelligence Index Human preference & holistic | A composite index of overall model intelligence aggregating performance across reasoning, coding, knowledge, science and agentic tasks. | Artificial Analysis (independent) | 2024 | Active | ~60 (index) Claude Fable 5 |
| HELM Human preference & holistic | Multi-metric holistic evaluation across many scenarios, reporting accuracy alongside calibration, robustness, fairness, bias, toxicity and efficiency. | Stanford CRFM (Liang, Bommasani et al.) | 2022 | Active | See leaderboard |
| LMArena Human preference & holistic | Crowdsourced human preference between two anonymized model responses, aggregated into a relative ranking, not an objective capability. | LMArena (formerly LMSYS; Zheng, Chiang et al.) | 2023 | Active | ~1510 Elo Claude Opus 4.8 |
| MT-Bench Human preference & holistic | Instruction-following and conversational quality on multi-turn prompts, scored automatically by a strong LLM judge. | LMSYS (Zheng et al., UC Berkeley) | 2023 | Saturated | See leaderboard |
| HaluEval Safety, hallucination & factuality | A model's ability to recognize hallucinated content across question answering, knowledge-grounded dialogue and summarization. | Li et al. (Renmin University of China) | 2023 | Active | See leaderboard |
| TruthfulQA Safety, hallucination & factuality | Whether a model avoids repeating common human misconceptions when answering questions, rather than imitating popular falsehoods. | Lin, Hilton, Evans (Oxford and OpenAI) | 2021 | Active | See leaderboard |
| Vectara Hallucination Leaderboard Safety, hallucination & factuality | How often a model introduces unsupported content when summarizing a provided source document, i.e. faithfulness in closed-book summarization. | Vectara (Hughes et al.) | 2023 | Active | 1.8% (lower is better) antgroup/finix-s1-32b |
Top scores are representative snapshots as of June 2026, not live readings: leaderboards move constantly, and many figures come from each benchmark's own board (linked on the name). Where no clean current top score could be confirmed from a primary source, the cell reads "See leaderboard." Confirm at the source before quoting a number.
How to read a benchmark score
A headline score is only as good as the benchmark behind it. Four failure modes decide whether a number means anything, and the strongest benchmarks are the ones that resist all four. The scorecard below rates the most-quoted benchmarks on each concern, where higher always means a less trustworthy score. For the full argument and the receipts, read are AI benchmarks reliable and our breakdown of benchmark contamination.
| Benchmark | Contamination | Saturation | Gameability | Real-world gap |
|---|---|---|---|---|
| MMLU (knowledge MCQ) | High | High | High | High |
| SWE-bench Verified (real GitHub fixes) | High | High | High | Medium |
| LMArena (human preference) | Medium | Low | High | High |
| GPQA Diamond (PhD-level science) | Medium | High | Medium | Medium |
| ARC-AGI v2 (abstract reasoning) | Low | Low | Medium | High |
| FrontierMath (research math) | Medium | Medium | Low | Medium |
| Terminal-Bench (terminal tasks) | Low | Medium | Low | Low |
Which benchmark should you actually watch?
Pick by the decision you are making, not by whichever number a vendor leads with.
- Choosing a coding model: read a completion benchmark and a quality benchmark together. DeepSWE measures whether an agent finishes the task; FrontierCode measures whether you would merge its code, and the same model can ace one while failing the other. Add Terminal-Bench for agentic, run-it-in-a-real-terminal work.
- Judging raw reasoning: watch the unsaturated sets. ARC-AGI-2 (abstract reasoning), Humanity’s Last Exam (expert breadth) and FrontierMath (research math) are still far from solved, so movement there is real progress, not noise.
- Comparing general capability: a human-preference ranking like LMArena is the closest to "which feels better to use," but it rewards style as much as substance, so pair it with a composite index and a hard reasoning score.
- Ignore the saturated ones: MMLU, GSM8K, MATH and HumanEval are quoted out of habit. When every frontier model scores above 95%, the benchmark is measuring the ceiling, not the model.
For how these benchmarks rank the current crop of agents, see how the 2026 coding-agent benchmarks actually rank, the coding agents that wrap these models, and the value leaderboard for points-per-dollar.
Why this is a snapshot, not a live feed
Benchmark scores are the most volatile, most gamed layer of model marketing. Leaderboards re-rank weekly, labs report only the tests they win, and the same benchmark can be run under different scaffolds that move the number by ten points or more. This directory is therefore a representative map as of June 29, 2026, built to show what each benchmark means and ground it to its own source, not to quote a score you can hold anyone to. The durable value is the left of the table (what it measures, who built it, whether it is still meaningful); the top-score column is a pointer to the live leaderboard, where you should always confirm before quoting a number.
Frequently asked questions
- What is an AI benchmark?
- An AI benchmark is a standardized test made of a fixed dataset, a task specification, and a scoring metric, used to measure and compare how well AI models perform a specific skill such as reasoning, coding, math, or knowledge. Running many models through the same test produces a single comparable score, which is what model leaderboards rank.
- What are the main AI benchmarks in 2026?
- For coding, SWE-bench Verified, DeepSWE, and Terminal-Bench. For reasoning, GPQA Diamond, ARC-AGI-2, and Humanity’s Last Exam. For math, FrontierMath and AIME. For broad knowledge, MMLU-Pro. For multimodal, MMMU. And for overall human preference, the LMArena Elo ranking. Many older benchmarks like MMLU, GSM8K, and HumanEval are now saturated and quoted mainly out of habit.
- Are AI benchmarks reliable?
- Partly. A benchmark is reliable only as far as its score reflects real capability, and four things erode that: contamination (test data leaking into training), saturation (top models bunched near the ceiling), gameability (a score inflated without real skill), and vendor cherry-picking (a lab reporting only the benchmarks it wins). The most trustworthy benchmarks are contamination-resistant, unsaturated, and run by an independent party. For the full breakdown, see our guide on whether AI benchmarks are reliable.
- What does it mean when a benchmark is saturated?
- A benchmark is saturated when the strongest models all score near its ceiling, so the differences between them are within noise and the benchmark no longer separates a better model from a worse one. MMLU, GSM8K, MATH, and HumanEval are all saturated in 2026, with top models above 95%, which is why the field keeps building harder replacements like MMLU-Pro and ARC-AGI-2.
- What is the difference between SWE-bench and SWE-bench Verified?
- SWE-bench is the original 2,294-task set of real GitHub issues. SWE-bench Verified is a 500-task subset that OpenAI and the SWE-bench authors hand-checked in 2024 to remove broken tests and unsolvable issues, so it is the cleaner, more-quoted version. By 2026 even Verified is treated as saturated and contamination-prone, and OpenAI now recommends the harder SWE-bench Pro instead.
- Which AI benchmark matters most for coding?
- There is no single one, because they measure different things. SWE-bench Verified and DeepSWE measure whether an agent can resolve a real issue; Terminal-Bench measures whether it can operate a real terminal end to end; FrontierCode measures whether the code is clean enough to merge; LiveCodeBench measures contamination-free competitive programming. For shipping production code, read a completion benchmark and a quality benchmark together rather than trusting one number.
- What is benchmark contamination?
- Contamination is when a benchmark’s questions or answers leak into a model’s training data, so the model can recall the answer instead of reasoning it out. It inflates scores without reflecting real capability and is the main reason public, static benchmarks decay over time. The defenses are private or held-out test sets, time-stamped problems released after a model’s cutoff, and freshly generated tasks.
- What is GPQA Diamond, and why is it hard?
- GPQA Diamond is a 198-question set of graduate and PhD-level biology, physics, and chemistry questions written by domain experts to be Google-proof, meaning a non-expert with web access still cannot answer them quickly. It tests reasoning over retrieval. By 2026 top models exceed the roughly 70% human-expert baseline and sit in the low-to-mid 90s, so it is now largely saturated.
Sources
Each benchmark is grounded to its primary source: the original paper, the project repository, or the official leaderboard, verified June 29, 2026. Top scores are representative snapshots from those sources:
- Aider Polyglot (2024). Aider (Paul Gauthier). aider.chat/2024/12/21/polyglot.html
- BigCodeBench (2024). BigCode project (Zhuo et al.). arxiv.org/abs/2406.15877
- DeepSWE (2026). Datacurve. github.com/datacurve-ai/deep-swe
- FrontierCode (2026). Cognition (with 20+ open-source maintainers). cognition.com/blog/frontier-code
- HumanEval (2021). OpenAI (Chen et al.). arxiv.org/abs/2107.03374
- LiveCodeBench (2024). UC Berkeley, MIT and Cornell (Jain, Han et al.). arxiv.org/abs/2403.07974
- MBPP (2021). Google Research (Austin, Odena et al.). arxiv.org/abs/2108.07732
- Multi-SWE-bench (2025). ByteDance (ByteDance Seed). arxiv.org/abs/2504.02605
- RepoBench (2023). Liu, Xu and McAuley (UC San Diego). arxiv.org/abs/2306.03091
- SWE-bench (2023). Princeton and Stanford (Jimenez, Yang, Yao et al.). arxiv.org/abs/2310.06770
- SWE-bench Multimodal (2024). Stanford and Princeton (Yang, Jimenez et al.). arxiv.org/abs/2410.03859
- SWE-bench Pro (2025). Scale AI (Scale Labs). arxiv.org/abs/2509.16941
- SWE-bench Verified (2024). OpenAI (with the SWE-bench authors). openai.com/index/introducing-swe-bench-verified/
- SWE-Lancer (2025). OpenAI (Miserendino, Patwardhan et al.). arxiv.org/abs/2502.12115
- Terminal-Bench (2026). Stanford and the Laude Institute. arxiv.org/abs/2601.11868
- AgentBench (2023). Tsinghua University (THUDM; Liu et al.). arxiv.org/abs/2308.03688
- BrowseComp (2025). OpenAI (Wei, Sun et al.). arxiv.org/abs/2504.12516
- GAIA (2023). Meta AI and Hugging Face (Mialon, Fourrier et al.). arxiv.org/abs/2311.12983
- MLE-bench (2024). OpenAI (Chan et al.). arxiv.org/abs/2410.07095
- OSWorld (2024). XLANG Lab, University of Hong Kong (Xie et al.). arxiv.org/abs/2404.07972
- tau-bench (2024). Sierra (Yao, Shinn, Narasimhan et al.). arxiv.org/abs/2406.12045
- VisualWebArena (2024). Carnegie Mellon University (Koh et al.). arxiv.org/abs/2401.13649
- WebArena (2023). Carnegie Mellon University (Zhou, Xu et al.). arxiv.org/abs/2307.13854
- ARC-AGI-1 (2019). Francois Chollet (ARC Prize Foundation). arcprize.org/arc-agi/1
- ARC-AGI-2 (2025). ARC Prize Foundation (Chollet et al.). arcprize.org/arc-agi/2
- BIG-Bench Hard (2022). Suzgun et al. (Google Research and Stanford). github.com/suzgunmirac/BIG-Bench-Hard
- GPQA Diamond (2023). Rein et al. (NYU, Cohere, Anthropic). arxiv.org/abs/2311.12022
- Humanity's Last Exam (2025). Center for AI Safety (CAIS) and Scale AI. arxiv.org/abs/2501.14249
- MuSR (2023). Sprague, Ye, Durrett et al. (UT Austin). arxiv.org/abs/2310.16049
- AIME 2025 (2025). Mathematical Association of America; adopted as an LLM eval by the community. matharena.ai/
- FrontierMath (2024). Epoch AI. epoch.ai/frontiermath
- GSM8K (2021). OpenAI (Cobbe et al.). arxiv.org/abs/2110.14168
- MATH (2021). Hendrycks et al. (UC Berkeley). arxiv.org/abs/2103.03874
- MathArena (2025). ETH Zurich (SRI Lab). arxiv.org/abs/2505.23281
- Omni-MATH (2024). Gao, Song, Cai et al. (Peking University and collaborators). arxiv.org/abs/2410.07985
- MMLU (2021). Hendrycks et al. (UC Berkeley and collaborators). arxiv.org/abs/2009.03300
- MMLU-Pro (2024). TIGER-Lab (Wang et al., University of Waterloo). arxiv.org/abs/2406.01574
- MMLU-Redux (2024). Gema et al. (University of Edinburgh and collaborators). arxiv.org/abs/2406.04127
- SimpleQA (2024). OpenAI (Wei, Karina et al.). arxiv.org/abs/2411.04368
- LongBench (2023). Tsinghua University (THUDM; Bai et al.). arxiv.org/abs/2412.15204
- MRCR (2024). Google DeepMind (Michelangelo); open-source variant by OpenAI. arxiv.org/abs/2409.12640
- Needle-in-a-Haystack (2023). Greg Kamradt (independent). github.com/gkamradt/LLMTest_NeedleInAHaystack
- NoLiMa (2025). Adobe Research and LMU Munich (Modarressi et al.). arxiv.org/abs/2502.05167
- RULER (2024). NVIDIA (Hsieh, Sun et al.). arxiv.org/abs/2404.06654
- MathVista (2023). Lu et al. (UCLA, University of Washington, Microsoft Research). arxiv.org/abs/2310.02255
- MMMU (2023). MMMU team (Yue et al.). arxiv.org/abs/2311.16502
- MMMU-Pro (2024). MMMU team (Yue et al.). arxiv.org/abs/2409.02813
- Video-MME (2024). MME-Benchmarks team (Fu et al.). arxiv.org/abs/2405.21075
- Artificial Analysis Intelligence Index (2024). Artificial Analysis (independent). artificialanalysis.ai/methodology/intelligence-benchmarking
- HELM (2022). Stanford CRFM (Liang, Bommasani et al.). arxiv.org/abs/2211.09110
- LMArena (2023). LMArena (formerly LMSYS; Zheng, Chiang et al.). arxiv.org/abs/2403.04132
- MT-Bench (2023). LMSYS (Zheng et al., UC Berkeley). arxiv.org/abs/2306.05685
- HaluEval (2023). Li et al. (Renmin University of China). arxiv.org/abs/2305.11747
- TruthfulQA (2021). Lin, Hilton, Evans (Oxford and OpenAI). arxiv.org/abs/2109.07958
- Vectara Hallucination Leaderboard (2023). Vectara (Hughes et al.). github.com/vectara/hallucination-leaderboard
Machine-readable data: /ai-benchmarks.json. Benchmark reliability ratings are from our benchmark trust scorecard.