Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
3 days ago
- #LLMs
- #USAMO
- #Mathematical Reasoning
- Current benchmarks for large language models (LLMs) like MathArena focus on numerical answers but neglect rigorous reasoning and proof generation.
- A new evaluation tested state-of-the-art reasoning models on the 2025 USAMO problems, revealing they struggled significantly with less than 5% average performance.
- Detailed analysis of reasoning traces identified common failure modes and unwanted artifacts from model training strategies.
- The study concludes that current LLMs are inadequate for rigorous mathematical reasoning, highlighting the need for improvements in reasoning and proof generation.