DeepSWE: A contamination-free benchmark for long-horizon coding agents
4 hours ago
- #Long-horizon tasks
- #AI coding agents
- #software engineering benchmark
- DeepSWE is a novel software engineering benchmark focusing on original, long-horizon tasks, designed to be contamination-free.
- It features 113 tasks across 91 repositories in 5 programming languages: TypeScript, Go, Python, JavaScript, and Rust.
- Tasks have shorter prompts but require more extensive code changes (5.5x more code) compared to existing benchmarks like SWE-Bench Pro.
- The benchmark uses hand-written verifiers that test observable behavior, reducing false positives and false negatives.
- DeepSWE better separates the performance of frontier coding agents, showing wider gaps that align with real-world developer experiences.
- It avoids issues like benchmark contamination and verifier misgrading found in existing benchmarks, such as SWE-Bench Pro.
- The evaluation harness mini-swe-agent is used consistently across all models to ensure fair comparisons.
- Results indicate that stronger models, like GPT-5.5, achieve higher pass rates and are more efficient in terms of tokens and cost.
- Qualitative analysis reveals distinct failure patterns by model families, such as Claude's forgetfulness with multi-part prompts and GPT's precision.
- Limitations include the use of a single harness, focus on open-source repositories, and under-representation of some languages and task types.
- Future work may involve testing with multiple harnesses, expanding the corpus, and improving verifiers for more naturalistic prompts.