SWE-Bench Failures: When Coding Agents Spiral into 693 Lines of Hallucinations

17 hours ago

Copy Link

SWE-bench Bash Only tests coding models' ability to fix GitHub issues using shell commands, with top models failing on 1 in 3 issues.
Gemini 2.5 Pro hallucinated entire classes and methods due to missing file context, leading to a catastrophic failure with 693 lines of incorrect code.
Claude Sonnet 4 initially made similar mistakes but recovered by recognizing errors and reinvestigating, eventually finding the correct fix.
GPT-5 avoided hallucinations by explicitly rechecking missing context and solved the problem correctly on the first attempt.
Key failure patterns include failure to recognize missing information, lack of verification, and doubling down on bad assumptions.
The case study highlights the importance of models verifying assumptions and recovering from errors to avoid hallucination spirals.
Post-training improvements focus on teaching models to handle uncertainty, verify beliefs, and recover gracefully from mistakes.

Hasty Briefsbeta