- Current benchmarks for large language models (LLMs) like MathArena focus on numerical answers but neglect rigorous reasoning and proof generation.
- A new evaluation tested state-of-the-art reasoning models on the 2025 USAMO problems, revealing they struggled significantly with less than 5% average performance.
- Detailed analysis of reasoning traces identified common failure modes and unwanted artifacts from model training strategies.
- The study concludes that current LLMs are inadequate for rigorous mathematical reasoning, highlighting the need for improvements in reasoning and proof generation.