FrontierCode

6 hours ago

FrontierCode is a new benchmark to measure AI model code quality, focusing on mergeability into production codebases.
It emphasizes criteria like correctness, test quality, scope, style, and codebase standards, using novel grading methods including unit tests, rubrics, and verifiers.
The benchmark was crafted by over 20 world-class open-source maintainers, with each task requiring 40+ hours of effort and manual review by Cognition researchers.
FrontierCode reduces misclassification errors by 81% compared to SWE-Bench Pro, offering a more accurate ranking of models.
Results show models struggle with quality; Claude Opus 4.8 leads with 13.4% on Diamond (hardest subset), while GPT-5.5 uses fewer tokens for better cost-efficiency.
Tasks are designed with concise prompts, diverse languages, and focus on quality rubrics over patch size, making them harder than existing benchmarks.
Evaluation axes include behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality.
Novel grading techniques include reverse-classical tests, code scope checks, and adaptive classical grading to handle open-ended solutions.
A rigorous quality control process involves design, hack reports, rubric calibration, and multi-stage reviews to ensure reliability.
The benchmark aims to push AI coding forward by assessing production readiness, though tasks are not publicly released to avoid contamination.

Hasty Briefsbeta