Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

3 days ago

Copy Link

Current benchmarks for large language models (LLMs) like MathArena focus on numerical answers but neglect rigorous reasoning and proof generation.
A new evaluation tested state-of-the-art reasoning models on the 2025 USAMO problems, revealing they struggled significantly with less than 5% average performance.
Detailed analysis of reasoning traces identified common failure modes and unwanted artifacts from model training strategies.
The study concludes that current LLMs are inadequate for rigorous mathematical reasoning, highlighting the need for improvements in reasoning and proof generation.

Hasty Briefsbeta