Are LLMs not getting better?
2 days ago
- #Performance
- #LLM
- #Programming
- LLMs' code passes tests more often than it meets mergeable quality standards.
- Performance drops significantly when success is measured by maintainer approval rather than test passing.
- Merge rates for LLM-generated code show no improvement since early 2025, contrary to some claims.
- Statistical analysis (Brier score) shows constant merge rate models outperform linear or logistic growth trends.
- Claims of recent capability improvements lack rigorous evidence, similar to unsubstantiated claims in 2025.