Many SWE-bench-Passing PRs would not be merged

2 days ago

Maintainer merge decisions are about 24 percentage points lower than SWE-bench automated grader scores.
AI-generated PRs that pass automated grading are merged at roughly half the rate of human-written PRs.
Improvement rate (pp/yr) is 9.6 slower for maintainer decisions compared to automated grading.
Primary rejection reasons: core functionality failure, breaks other code, and code quality issues.
Study limitations include subset of benchmark, no CI in review, and static patch comparison.
Results caution against naive extrapolation of benchmark scores to real-world usefulness.

Hasty Briefsbeta