Why most AI coding benchmarks are misleading (COMPASS paper)

16 hours ago

Copy Link

COMPASS is a multi-dimensional benchmark for evaluating code generation in large language models.
It assesses code generation across three dimensions: correctness, efficiency, and quality.
COMPASS consists of 50 competitive programming problems from real Codility competitions.
It provides authentic human baselines from 393,150 submissions.
The benchmark evaluates runtime efficiency and code quality using industry-standard analysis tools.
Evaluation of leading models (Claude Opus 4, Gemini 2.5 Pro, O4-Mini-High) shows high correctness scores don't guarantee efficient or maintainable code.
COMPASS highlights the importance of evaluating beyond correctness for real-world code generation capabilities.

Hasty Briefsbeta