Are LLMs not getting better?

2 days ago

LLMs' code passes tests more often than it meets mergeable quality standards.
Performance drops significantly when success is measured by maintainer approval rather than test passing.
Merge rates for LLM-generated code show no improvement since early 2025, contrary to some claims.
Statistical analysis (Brier score) shows constant merge rate models outperform linear or logistic growth trends.
Claims of recent capability improvements lack rigorous evidence, similar to unsubstantiated claims in 2025.

Hasty Briefsbeta