Arena AI Model ELO History

4 hours ago

AI labs frequently update models post-launch, potentially introducing nerfs like censorship, quantization, or performance degradation.
LMSYS Arena uses API endpoints for raw model testing, but web interfaces may differ due to added system prompts, filters, or quantized versions.
Data is sourced daily from the official LM Arena Leaderboard Dataset on Hugging Face, based on human evaluations for robust capability metrics.
The chart shows each major AI lab's flagship lineage curve, tracking the highest-rated eligible model over time, not just the latest release.
Flagship models (e.g., Opus) remain on the curve even if mid-tier models (e.g., Sonnet) are released, with inference variants collapsed to avoid fluctuations.
New releases appear as labeled markers, often with score jumps, and degradation trends between releases are highlighted for visibility.

Hasty Briefsbeta