Hasty Briefsbeta

SWE-Bench Failures: When Coding Agents Spiral into 693 Lines of Hallucinations

17 hours ago
  • #SWE-bench
  • #hallucination spirals
  • #AI coding agents
  • SWE-bench Bash Only tests coding models' ability to fix GitHub issues using shell commands, with top models failing on 1 in 3 issues.
  • Gemini 2.5 Pro hallucinated entire classes and methods due to missing file context, leading to a catastrophic failure with 693 lines of incorrect code.
  • Claude Sonnet 4 initially made similar mistakes but recovered by recognizing errors and reinvestigating, eventually finding the correct fix.
  • GPT-5 avoided hallucinations by explicitly rechecking missing context and solved the problem correctly on the first attempt.
  • Key failure patterns include failure to recognize missing information, lack of verification, and doubling down on bad assumptions.
  • The case study highlights the importance of models verifying assumptions and recovering from errors to avoid hallucination spirals.
  • Post-training improvements focus on teaching models to handle uncertainty, verify beliefs, and recover gracefully from mistakes.