Hasty Briefsbeta

Bilingual

Many SWE-bench-Passing PRs would not be merged

2 days ago
  • #AI-coding
  • #maintainer-review
  • #benchmarking
  • Maintainer merge decisions are about 24 percentage points lower than SWE-bench automated grader scores.
  • AI-generated PRs that pass automated grading are merged at roughly half the rate of human-written PRs.
  • Improvement rate (pp/yr) is 9.6 slower for maintainer decisions compared to automated grading.
  • Primary rejection reasons: core functionality failure, breaks other code, and code quality issues.
  • Study limitations include subset of benchmark, no CI in review, and static patch comparison.
  • Results caution against naive extrapolation of benchmark scores to real-world usefulness.