Many SWE-bench-Passing PRs would not be merged
2 days ago
- #AI-coding
- #maintainer-review
- #benchmarking
- Maintainer merge decisions are about 24 percentage points lower than SWE-bench automated grader scores.
- AI-generated PRs that pass automated grading are merged at roughly half the rate of human-written PRs.
- Improvement rate (pp/yr) is 9.6 slower for maintainer decisions compared to automated grading.
- Primary rejection reasons: core functionality failure, breaks other code, and code quality issues.
- Study limitations include subset of benchmark, no CI in review, and static patch comparison.
- Results caution against naive extrapolation of benchmark scores to real-world usefulness.