Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers
17 hours ago
- #Benchmarking
- #Software Engineering
- #AI Agents
- Senior SWE-Bench treats agents like senior engineers by using realistic, natural language instructions instead of over-specified requirements.
- It introduces a validation agent that writes behavioral tests to evaluate tasks, adapting to submitted solutions.
- Bug tasks are based on tricky user reports requiring runtime investigation, such as debugging logs and reproduction steps.
- Scoring combines runtime correctness tests with quality metrics to assess tasteful code solutions.
- Tasks are sourced from PRs in diverse repositories and involve multi-phase, multi-stack features or bugs with significant runtime investigation.
- Instructions are naturally under-specified, with a median length 31% that of SWE-Bench Pro.
- Feature tasks can span multiple services, averaging 11 files touched per task, and are long-horizon, requiring hundreds of steps.
- A leaderboard shows top-performing models like Claude Opus 4.8 achieving a 24.0% solve rate, with frontier models failing over 75% of the time.