Meta got caught gaming LMArena

a year ago

Meta released two new Llama 4 models: Scout (smaller) and Maverick (mid-size).
Maverick claimed to outperform GPT-4o and Gemini 2.0 Flash in benchmarks.
Maverick ranked second on LMArena with an ELO score of 1417.
Meta used an 'experimental chat version' of Maverick optimized for LMArena, not the public version.
LMArena criticized Meta for not clarifying the model's customization, updating its policies.
Meta defended the move, stating they experiment with custom variants.
Concerns arose about Meta potentially training models to perform better on benchmarks.
Meta denied training on test sets, attributing performance variability to implementation issues.
Llama 4's release was delayed due to internal performance concerns.
Benchmarks may not reflect real-world model performance, misleading developers.

Hasty Briefsbeta