Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

24 days ago

Senior SWE-Bench treats agents like senior engineers by using realistic, natural language instructions instead of over-specified requirements.
It introduces a validation agent that writes behavioral tests to evaluate tasks, adapting to submitted solutions.
Bug tasks are based on tricky user reports requiring runtime investigation, such as debugging logs and reproduction steps.
Scoring combines runtime correctness tests with quality metrics to assess tasteful code solutions.
Tasks are sourced from PRs in diverse repositories and involve multi-phase, multi-stack features or bugs with significant runtime investigation.
Instructions are naturally under-specified, with a median length 31% that of SWE-Bench Pro.
Feature tasks can span multiple services, averaging 11 files touched per task, and are long-horizon, requiring hundreds of steps.
A leaderboard shows top-performing models like Claude Opus 4.8 achieving a 24.0% solve rate, with frontier models failing over 75% of the time.

Hasty Briefsbeta