Hasty Briefsbeta

Bilingual

Meta got caught gaming LMArena

a year ago
  • #Meta
  • #AI
  • #Benchmarks
  • Meta released two new Llama 4 models: Scout (smaller) and Maverick (mid-size).
  • Maverick claimed to outperform GPT-4o and Gemini 2.0 Flash in benchmarks.
  • Maverick ranked second on LMArena with an ELO score of 1417.
  • Meta used an 'experimental chat version' of Maverick optimized for LMArena, not the public version.
  • LMArena criticized Meta for not clarifying the model's customization, updating its policies.
  • Meta defended the move, stating they experiment with custom variants.
  • Concerns arose about Meta potentially training models to perform better on benchmarks.
  • Meta denied training on test sets, attributing performance variability to implementation issues.
  • Llama 4's release was delayed due to internal performance concerns.
  • Benchmarks may not reflect real-world model performance, misleading developers.