Skip to content
Capital & Compute
· ai· benchmarks· coding-agents· evaluation

DeepSWE vs FrontierCode: Two Ways to Grade AI Code

DeepSWE grades whether an AI finishes the task. FrontierCode grades whether you would merge its code. Why the same model scores 59% and 13%.

By Capital & Compute

DeepSWE and FrontierCode are both 2026 coding-agent benchmarks built on real open-source repositories, and they disagree about almost everything. DeepSWE asks one question: did the agent’s patch make the task work? FrontierCode asks a different one: would a maintainer actually merge this pull request? The same model can ace the first and fail the second. Claude Opus 4.8 solves 59% of DeepSWE’s tasks on the first try (Datacurve leaderboard, 2026) but clears only 13.4% of FrontierCode’s hardest tier (Cognition, vendor blog, June 8 2026). That gap is the whole story, and it is the one the “vibe-coding cliff” is made of.

What DeepSWE measures: can the agent finish?

DeepSWE is a benchmark from the data-labeling company Datacurve, built to measure “frontier coding agents on original, long-horizon engineering tasks.” It is 113 tasks drawn from 91 repositories across five languages (TypeScript, Go, Python, JavaScript, and Rust). Two design choices define it.

First, it is contamination-free. The tasks are written from scratch rather than scraped from existing commits or pull requests, so no model has seen the solution during pretraining. That matters because the most-quoted older benchmark, SWE-bench, leaks: OpenAI stopped reporting SWE-bench Verified in 2026 because frontier models reproduce the reference patches rather than reasoning them out. (For the full mechanism, see our breakdown of benchmark contamination.) DeepSWE’s tasks are also genuinely heavy: Datacurve reports solutions require roughly 5.5x more code than SWE-bench Pro despite prompts that are about half the length.

Second, and this is the crux, DeepSWE grades behavior, not structure. Its verifiers are hand-written to “test software behavior rather than implementation details.” The benchmark “accepts any solution whose observable behavior is correct, regardless of internal symbol names or structure.” Scoring is pass@1: the percentage of tasks the agent solves on the first attempt. A patch that is ugly, sprawling, badly named, and touches twelve files it did not need to touch still scores a full point, as long as the tests go green.

On that bar, the frontier looks strong. Claude Fable 5 leads at 70%, GPT-5.5 follows at 67%, and Opus 4.8 lands at 59%, with the best open-weight model, GLM-5.2, at 44% and Kimi K2.7-Code at 31% (Datacurve leaderboard, verified June 29 2026). Those are the numbers a vendor would put on a slide.

What FrontierCode measures: would you merge it?

FrontierCode, from Cognition (the company behind the Devin coding agent), was built to grade the thing DeepSWE deliberately ignores. Cognition calls it “the first benchmark to measure code mergeability,” and the framing is a question every engineer recognizes: “Would the maintainer actually merge this PR?”

Instead of one pass/fail check, FrontierCode grades six dimensions:

  • Behavioral correctness: does the patch solve the problem?
  • Regression safety: does it break existing functionality?
  • Mechanical cleanliness: does it pass build, lint, and style checks?
  • Test correctness: do the submitted tests actually capture the desired behavior?
  • Scope: does the patch change only the files it needs to?
  • Code quality: does it follow the repository’s conventions and design patterns?

The scoring is gated. A solution has to clear “blocker” criteria (hard stops like a correctness or scope violation) or it scores zero; only then does it earn weighted credit on the rubric. The tasks themselves were hand-curated by more than 20 open-source maintainers across 36 flagship repositories, spending over 40 hours building each task from repos they personally maintain. Cognition reports the result is an 81% lower false-positive rate than SWE-Bench Pro, meaning far fewer patches that pass the test but would never survive code review.

DeepSWE
Datacurve · task completion
VS
FrontierCode
Cognition · code mergeability
Did it work?
Core question
Would you merge it?
Observable behavior, pass / fail
What it grades
Six dimensions incl. code quality
pass@1, one binary verdict
Scoring
Blockers gate, then weighted rubric
113 original, 91 repos, 5 langs
Tasks
150 maintainer-built, 36 repos
70% (Fable 5)
Top score
13.4% Diamond (Opus 4.8)
Yes, by design
Ignores code style?
No, it penalizes it

That last row is the entire difference in two words. DeepSWE ignores style by design. FrontierCode penalizes it on purpose.

Why the scores diverge so violently

Hold one model still and switch the grading rubric, and the score falls off a shelf. Opus 4.8 goes from 59% on DeepSWE to 51.8% on FrontierCode’s full 150-task Extended set, to 34.3% on the 100-task Main set, to 13.4% on the 50 hardest Diamond tasks. Nothing about the model changed between those numbers. Only the question being asked did.

59%
Opus 4.8 on DeepSWE
pass@1: did the patch make the task work?
13.4%
Opus 4.8 on FrontierCode (Diamond)
best score on the 50 hardest mergeability-graded tasks
6
quality dimensions FrontierCode grades
vs one pass / fail check on DeepSWE
81%
lower false-positive rate
FrontierCode vs SWE-Bench Pro, per Cognition

The same collapse shows up across the frontier. On DeepSWE every leading model clears the task; on FrontierCode’s hardest tier, mergeable code is rare even for the best of them.

DeepSWE completion score vs FrontierCode mergeability score, same modelsGPT-5.5 falls from 67% on DeepSWE to 6.3% on FrontierCode Diamond; Opus 4.8 from 59% to 13.4%; Gemini 3.1 Pro from 12% to 4.7%.DeepSWE (task completion, pass@1)FrontierCode Diamond (mergeable quality)0%20%40%60%80%GPT-5.567%6.3%Claude Opus 4.859%13.4%Gemini 3.1 Pro12%4.7%
DeepSWE completion score vs FrontierCode mergeability score, same models
ItemDeepSWE (task completion, pass@1)FrontierCode Diamond (mergeable quality)
GPT-5.567%6.3%
Claude Opus 4.859%13.4%
Gemini 3.1 Pro12%4.7%
The same models, graded two ways. The slate dot is DeepSWE pass@1 (did the patch work). The gold dot is FrontierCode's Diamond tier (would a maintainer merge it). These are different benchmarks on different task sets, and Diamond is FrontierCode's hardest 50 of 150 tasks, so the gap mixes raw difficulty with the quality bar. Even on FrontierCode's easiest Extended tier the ceiling stays low: Opus 4.8 tops out at 51.8%. Directional, not a controlled head-to-head.Source: Datacurve DeepSWE leaderboard and Cognition FrontierCode, verified June 29 2026

There is a cost twist worth naming. Cognition notes that GPT-5.5 uses up to 4x fewer tokens than Opus 4.8, a better cost-to-intelligence tradeoff, yet it is GPT-5.5 that posts the steepest drop on the quality-graded tier (67% to 6.3%). The token-efficient model is not the merge-ready one. If you have been picking a coding model on price per task, that is the kind of hidden liability the cheaper number hides, the same theme as our look at vibe-coding cost economics.

What this means for open models

This is where the Reddit thread that started this comparison lands. On DeepSWE, open weights look competitive: GLM-5.2 at 44% and Kimi K2.7-Code at 31% are credible against closed frontier models. On FrontierCode the picture darkens. Kimi K2.6, the best-scoring open model Cognition reports, manages 3.8% on Diamond, 16% on Main, and 37% on Extended, well behind Opus 4.8 at every tier. Cognition did not publish a GLM number, so its mergeability is simply unmeasured here.

The takeaway is not that open models are bad. It is that “can complete the task” and “writes code you would merge” are different skills, and the gap between them widens exactly where it costs you most: on hard, long-lived code. A model that scores well on completion but poorly on mergeability is the engine of the vibe-coding cliff, where a project sails along until it grows past the point where unmergeable, sprawling, convention-breaking code can be patched over. For the broader field, see how the 2026 coding-agent benchmarks actually rank and which coding agents wrap these models, or browse the full AI benchmarks directory for what every major test measures.

Bottom line

Read both, and read them for different things. If your question is “can this agent get a feature working end to end,” DeepSWE’s contamination-free, behavior-graded pass@1 is the cleaner signal, and the frontier looks strong. If your question is “can I let this agent touch a codebase I have to maintain,” FrontierCode’s mergeability rubric is the more honest mirror, and the scores are sobering: even the best model clears barely one in eight of the hardest tasks. The single most useful habit is to stop treating a completion score as a quality score. They measure different things, a model can win one while losing the other, and the difference between them is precisely the code you would have to clean up.

Frequently asked questions

What is the difference between DeepSWE and FrontierCode?
DeepSWE (from Datacurve) measures task completion: a behavioral verifier checks whether an agent's patch makes the task work, scored pass@1, and accepts any solution that works regardless of its code structure. FrontierCode (from Cognition) measures mergeability: maintainer-authored rubrics grade six dimensions including code quality, scope, regression safety, and test correctness, so a working but sloppy patch can still fail. DeepSWE asks did it work; FrontierCode asks would you merge it.
Why does the same model score so differently on DeepSWE and FrontierCode?
Because they grade different things. Claude Opus 4.8 scores 59% on DeepSWE but 13.4% on FrontierCode's hardest Diamond tier. Nothing about the model changes; the question does. DeepSWE rewards a patch that passes the tests. FrontierCode also requires it to be clean, in-scope, well-tested, and convention-following, which is a much higher bar that most patches miss.
Is FrontierCode harder than DeepSWE?
On absolute scores, yes, by a wide margin. Top models clear 60 to 70% of DeepSWE but only single digits to low teens on FrontierCode's Diamond tier. That is because FrontierCode grades code quality and mergeability, not just whether the code runs, and its tasks were hand-built by open-source maintainers spending 40+ hours each. The two are not directly comparable, though, because they use different task sets.
Which benchmark should I trust for choosing a coding model?
Both, for different decisions. Use DeepSWE to judge whether an agent can complete real, original engineering tasks without having seen the answers in training. Use FrontierCode to judge whether the code it produces is clean enough to merge into a codebase you maintain. A high completion score with a low mergeability score signals an agent that finishes tasks but leaves technical debt, which is the pattern behind the vibe-coding cliff.
Is FrontierCode replacing SWE-bench?
Not replacing, but addressing its biggest weakness. SWE-bench checks whether tests pass, and over half of its passing patches are unmergeable in practice, which is why OpenAI stopped reporting SWE-bench Verified in 2026. Cognition reports FrontierCode has an 81% lower false-positive rate than SWE-Bench Pro because it grades mergeability directly. Both DeepSWE and FrontierCode are newer benchmarks built to fix what SWE-bench can no longer measure.

Sources

Subscribe to Capital & Compute

Source-backed analysis of what AI compute really costs, sent when a new post goes live.

No spam. Unsubscribe anytime.

← Back to all posts