Meta got caught gaming LMArena
a year ago
- #Meta
- #AI
- #Benchmarks
- Meta released two new Llama 4 models: Scout (smaller) and Maverick (mid-size).
- Maverick claimed to outperform GPT-4o and Gemini 2.0 Flash in benchmarks.
- Maverick ranked second on LMArena with an ELO score of 1417.
- Meta used an 'experimental chat version' of Maverick optimized for LMArena, not the public version.
- LMArena criticized Meta for not clarifying the model's customization, updating its policies.
- Meta defended the move, stating they experiment with custom variants.
- Concerns arose about Meta potentially training models to perform better on benchmarks.
- Meta denied training on test sets, attributing performance variability to implementation issues.
- Llama 4's release was delayed due to internal performance concerns.
- Benchmarks may not reflect real-world model performance, misleading developers.