Hasty Briefsbeta

Claude Code Daily Benchmarks for Degradation Tracking

9 days ago
  • #AI Performance
  • #Code Generation
  • #Benchmarking
  • Daily benchmarks on a curated subset of SWE-Bench-Pro to detect performance degradations in Claude Code Opus 4.5.
  • Statistical testing used to identify significant degradations with 95% confidence intervals.
  • Benchmarks run directly in Claude Code CLI without custom harnesses for accurate user experience reflection.
  • Daily evaluations on N=50 test instances with aggregated weekly and monthly results for reliability.
  • Independent third-party monitoring with no affiliation to model providers, inspired by past Claude degradation postmortems.