Hasty Briefsbeta

Bilingual

DeepSWE: A contamination-free benchmark for long-horizon coding agents

6 hours ago
  • #Long-horizon tasks
  • #AI coding agents
  • #software engineering benchmark
  • DeepSWE is a novel software engineering benchmark focusing on original, long-horizon tasks, designed to be contamination-free.
  • It features 113 tasks across 91 repositories in 5 programming languages: TypeScript, Go, Python, JavaScript, and Rust.
  • Tasks have shorter prompts but require more extensive code changes (5.5x more code) compared to existing benchmarks like SWE-Bench Pro.
  • The benchmark uses hand-written verifiers that test observable behavior, reducing false positives and false negatives.
  • DeepSWE better separates the performance of frontier coding agents, showing wider gaps that align with real-world developer experiences.
  • It avoids issues like benchmark contamination and verifier misgrading found in existing benchmarks, such as SWE-Bench Pro.
  • The evaluation harness mini-swe-agent is used consistently across all models to ensure fair comparisons.
  • Results indicate that stronger models, like GPT-5.5, achieve higher pass rates and are more efficient in terms of tokens and cost.
  • Qualitative analysis reveals distinct failure patterns by model families, such as Claude's forgetfulness with multi-part prompts and GPT's precision.
  • Limitations include the use of a single harness, focus on open-source repositories, and under-representation of some languages and task types.
  • Future work may involve testing with multiple harnesses, expanding the corpus, and improving verifiers for more naturalistic prompts.