Hasty Briefsbeta

Bilingual

Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

18 hours ago
  • #Benchmarking
  • #Software Engineering
  • #AI Agents
  • Senior SWE-Bench treats agents like senior engineers by using realistic, natural language instructions instead of over-specified requirements.
  • It introduces a validation agent that writes behavioral tests to evaluate tasks, adapting to submitted solutions.
  • Bug tasks are based on tricky user reports requiring runtime investigation, such as debugging logs and reproduction steps.
  • Scoring combines runtime correctness tests with quality metrics to assess tasteful code solutions.
  • Tasks are sourced from PRs in diverse repositories and involve multi-phase, multi-stack features or bugs with significant runtime investigation.
  • Instructions are naturally under-specified, with a median length 31% that of SWE-Bench Pro.
  • Feature tasks can span multiple services, averaging 11 files touched per task, and are long-horizon, requiring hundreds of steps.
  • A leaderboard shows top-performing models like Claude Opus 4.8 achieving a 24.0% solve rate, with frontier models failing over 75% of the time.