Why most AI coding benchmarks are misleading (COMPASS paper)
16 hours ago
- #AI
- #code-generation
- #benchmark
- COMPASS is a multi-dimensional benchmark for evaluating code generation in large language models.
- It assesses code generation across three dimensions: correctness, efficiency, and quality.
- COMPASS consists of 50 competitive programming problems from real Codility competitions.
- It provides authentic human baselines from 393,150 submissions.
- The benchmark evaluates runtime efficiency and code quality using industry-standard analysis tools.
- Evaluation of leading models (Claude Opus 4, Gemini 2.5 Pro, O4-Mini-High) shows high correctness scores don't guarantee efficient or maintainable code.
- COMPASS highlights the importance of evaluating beyond correctness for real-world code generation capabilities.