DeepSWE: A contamination-free benchmark for long-horizon coding agents

6 hours ago

DeepSWE is a novel software engineering benchmark focusing on original, long-horizon tasks, designed to be contamination-free.
It features 113 tasks across 91 repositories in 5 programming languages: TypeScript, Go, Python, JavaScript, and Rust.
Tasks have shorter prompts but require more extensive code changes (5.5x more code) compared to existing benchmarks like SWE-Bench Pro.
The benchmark uses hand-written verifiers that test observable behavior, reducing false positives and false negatives.
DeepSWE better separates the performance of frontier coding agents, showing wider gaps that align with real-world developer experiences.
It avoids issues like benchmark contamination and verifier misgrading found in existing benchmarks, such as SWE-Bench Pro.
The evaluation harness mini-swe-agent is used consistently across all models to ensure fair comparisons.
Results indicate that stronger models, like GPT-5.5, achieve higher pass rates and are more efficient in terms of tokens and cost.
Qualitative analysis reveals distinct failure patterns by model families, such as Claude's forgetfulness with multi-part prompts and GPT's precision.
Limitations include the use of a single harness, focus on open-source repositories, and under-representation of some languages and task types.
Future work may involve testing with multiple harnesses, expanding the corpus, and improving verifiers for more naturalistic prompts.

Hasty Briefsbeta