Hasty Briefsbeta

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

3 days ago
  • #LLMs
  • #USAMO
  • #Mathematical Reasoning
  • Current benchmarks for large language models (LLMs) like MathArena focus on numerical answers but neglect rigorous reasoning and proof generation.
  • A new evaluation tested state-of-the-art reasoning models on the 2025 USAMO problems, revealing they struggled significantly with less than 5% average performance.
  • Detailed analysis of reasoning traces identified common failure modes and unwanted artifacts from model training strategies.
  • The study concludes that current LLMs are inadequate for rigorous mathematical reasoning, highlighting the need for improvements in reasoning and proof generation.