DeepSWE vs FrontierCode: Two Ways to Grade AI Code
DeepSWE grades whether an AI finishes the task. FrontierCode grades whether you would merge its code. Why the same model scores 59% and 13%.
By Capital & Compute
DeepSWE and FrontierCode are both 2026 coding-agent benchmarks built on real open-source repositories, and they disagree about almost everything. DeepSWE asks one question: did the agent’s patch make the task work? FrontierCode asks a different one: would a maintainer actually merge this pull request? The same model can ace the first and fail the second. Claude Opus 4.8 solves 59% of DeepSWE’s tasks on the first try (Datacurve leaderboard, 2026) but clears only 13.4% of FrontierCode’s hardest tier (Cognition, vendor blog, June 8 2026). That gap is the whole story, and it is the one the “vibe-coding cliff” is made of.
What DeepSWE measures: can the agent finish?
DeepSWE is a benchmark from the data-labeling company Datacurve, built to measure “frontier coding agents on original, long-horizon engineering tasks.” It is 113 tasks drawn from 91 repositories across five languages (TypeScript, Go, Python, JavaScript, and Rust). Two design choices define it.
First, it is contamination-free. The tasks are written from scratch rather than scraped from existing commits or pull requests, so no model has seen the solution during pretraining. That matters because the most-quoted older benchmark, SWE-bench, leaks: OpenAI stopped reporting SWE-bench Verified in 2026 because frontier models reproduce the reference patches rather than reasoning them out. (For the full mechanism, see our breakdown of benchmark contamination.) DeepSWE’s tasks are also genuinely heavy: Datacurve reports solutions require roughly 5.5x more code than SWE-bench Pro despite prompts that are about half the length.
Second, and this is the crux, DeepSWE grades behavior, not structure. Its verifiers are hand-written to “test software behavior rather than implementation details.” The benchmark “accepts any solution whose observable behavior is correct, regardless of internal symbol names or structure.” Scoring is pass@1: the percentage of tasks the agent solves on the first attempt. A patch that is ugly, sprawling, badly named, and touches twelve files it did not need to touch still scores a full point, as long as the tests go green.
On that bar, the frontier looks strong. Claude Fable 5 leads at 70%, GPT-5.5 follows at 67%, and Opus 4.8 lands at 59%, with the best open-weight model, GLM-5.2, at 44% and Kimi K2.7-Code at 31% (Datacurve leaderboard, verified June 29 2026). Those are the numbers a vendor would put on a slide.
What FrontierCode measures: would you merge it?
FrontierCode, from Cognition (the company behind the Devin coding agent), was built to grade the thing DeepSWE deliberately ignores. Cognition calls it “the first benchmark to measure code mergeability,” and the framing is a question every engineer recognizes: “Would the maintainer actually merge this PR?”
Instead of one pass/fail check, FrontierCode grades six dimensions:
- Behavioral correctness: does the patch solve the problem?
- Regression safety: does it break existing functionality?
- Mechanical cleanliness: does it pass build, lint, and style checks?
- Test correctness: do the submitted tests actually capture the desired behavior?
- Scope: does the patch change only the files it needs to?
- Code quality: does it follow the repository’s conventions and design patterns?
The scoring is gated. A solution has to clear “blocker” criteria (hard stops like a correctness or scope violation) or it scores zero; only then does it earn weighted credit on the rubric. The tasks themselves were hand-curated by more than 20 open-source maintainers across 36 flagship repositories, spending over 40 hours building each task from repos they personally maintain. Cognition reports the result is an 81% lower false-positive rate than SWE-Bench Pro, meaning far fewer patches that pass the test but would never survive code review.
- Did it work?
- Core question
- Would you merge it?
- Observable behavior, pass / fail
- What it grades
- Six dimensions incl. code quality
- pass@1, one binary verdict
- Scoring
- Blockers gate, then weighted rubric
- 113 original, 91 repos, 5 langs
- Tasks
- 150 maintainer-built, 36 repos
- 70% (Fable 5)
- Top score
- 13.4% Diamond (Opus 4.8)
- Yes, by design
- Ignores code style?
- No, it penalizes it
That last row is the entire difference in two words. DeepSWE ignores style by design. FrontierCode penalizes it on purpose.
Why the scores diverge so violently
Hold one model still and switch the grading rubric, and the score falls off a shelf. Opus 4.8 goes from 59% on DeepSWE to 51.8% on FrontierCode’s full 150-task Extended set, to 34.3% on the 100-task Main set, to 13.4% on the 50 hardest Diamond tasks. Nothing about the model changed between those numbers. Only the question being asked did.
The same collapse shows up across the frontier. On DeepSWE every leading model clears the task; on FrontierCode’s hardest tier, mergeable code is rare even for the best of them.
| Item | DeepSWE (task completion, pass@1) | FrontierCode Diamond (mergeable quality) |
|---|---|---|
| GPT-5.5 | 67% | 6.3% |
| Claude Opus 4.8 | 59% | 13.4% |
| Gemini 3.1 Pro | 12% | 4.7% |
There is a cost twist worth naming. Cognition notes that GPT-5.5 uses up to 4x fewer tokens than Opus 4.8, a better cost-to-intelligence tradeoff, yet it is GPT-5.5 that posts the steepest drop on the quality-graded tier (67% to 6.3%). The token-efficient model is not the merge-ready one. If you have been picking a coding model on price per task, that is the kind of hidden liability the cheaper number hides, the same theme as our look at vibe-coding cost economics.
What this means for open models
This is where the Reddit thread that started this comparison lands. On DeepSWE, open weights look competitive: GLM-5.2 at 44% and Kimi K2.7-Code at 31% are credible against closed frontier models. On FrontierCode the picture darkens. Kimi K2.6, the best-scoring open model Cognition reports, manages 3.8% on Diamond, 16% on Main, and 37% on Extended, well behind Opus 4.8 at every tier. Cognition did not publish a GLM number, so its mergeability is simply unmeasured here.
The takeaway is not that open models are bad. It is that “can complete the task” and “writes code you would merge” are different skills, and the gap between them widens exactly where it costs you most: on hard, long-lived code. A model that scores well on completion but poorly on mergeability is the engine of the vibe-coding cliff, where a project sails along until it grows past the point where unmergeable, sprawling, convention-breaking code can be patched over. For the broader field, see how the 2026 coding-agent benchmarks actually rank and which coding agents wrap these models, or browse the full AI benchmarks directory for what every major test measures.
Bottom line
Read both, and read them for different things. If your question is “can this agent get a feature working end to end,” DeepSWE’s contamination-free, behavior-graded pass@1 is the cleaner signal, and the frontier looks strong. If your question is “can I let this agent touch a codebase I have to maintain,” FrontierCode’s mergeability rubric is the more honest mirror, and the scores are sobering: even the best model clears barely one in eight of the hardest tasks. The single most useful habit is to stop treating a completion score as a quality score. They measure different things, a model can win one while losing the other, and the difference between them is precisely the code you would have to clean up.
Frequently asked questions
- What is the difference between DeepSWE and FrontierCode?
- DeepSWE (from Datacurve) measures task completion: a behavioral verifier checks whether an agent's patch makes the task work, scored pass@1, and accepts any solution that works regardless of its code structure. FrontierCode (from Cognition) measures mergeability: maintainer-authored rubrics grade six dimensions including code quality, scope, regression safety, and test correctness, so a working but sloppy patch can still fail. DeepSWE asks did it work; FrontierCode asks would you merge it.
- Why does the same model score so differently on DeepSWE and FrontierCode?
- Because they grade different things. Claude Opus 4.8 scores 59% on DeepSWE but 13.4% on FrontierCode's hardest Diamond tier. Nothing about the model changes; the question does. DeepSWE rewards a patch that passes the tests. FrontierCode also requires it to be clean, in-scope, well-tested, and convention-following, which is a much higher bar that most patches miss.
- Is FrontierCode harder than DeepSWE?
- On absolute scores, yes, by a wide margin. Top models clear 60 to 70% of DeepSWE but only single digits to low teens on FrontierCode's Diamond tier. That is because FrontierCode grades code quality and mergeability, not just whether the code runs, and its tasks were hand-built by open-source maintainers spending 40+ hours each. The two are not directly comparable, though, because they use different task sets.
- Which benchmark should I trust for choosing a coding model?
- Both, for different decisions. Use DeepSWE to judge whether an agent can complete real, original engineering tasks without having seen the answers in training. Use FrontierCode to judge whether the code it produces is clean enough to merge into a codebase you maintain. A high completion score with a low mergeability score signals an agent that finishes tasks but leaves technical debt, which is the pattern behind the vibe-coding cliff.
- Is FrontierCode replacing SWE-bench?
- Not replacing, but addressing its biggest weakness. SWE-bench checks whether tests pass, and over half of its passing patches are unmergeable in practice, which is why OpenAI stopped reporting SWE-bench Verified in 2026. Cognition reports FrontierCode has an 81% lower false-positive rate than SWE-Bench Pro because it grades mergeability directly. Both DeepSWE and FrontierCode are newer benchmarks built to fix what SWE-bench can no longer measure.
Sources
- Datacurve. (2026). DeepSWE: Measuring frontier coding agents on original, long-horizon engineering tasks [benchmark leaderboard and methodology; 113 tasks, 91 repos, pass@1, behavioral verifiers]. deepswe.datacurve.ai and github.com/datacurve-ai/deep-swe
- Cognition. (2026). Introducing FrontierCode [vendor blog, June 8 2026; six-dimension mergeability rubric, Diamond/Main/Extended tiers, 81% lower false-positive rate vs SWE-Bench Pro, 36 repos, 40+ hours per task]. cognition.com/blog/frontier-code
- OpenAI. (2026). Why we no longer evaluate SWE-bench Verified [vendor post; frontier models reproduce the reference patches]. openai.com/index/why-we-no-longer-evaluate-swe-bench-verified