Benchmarks in Leipzig
3 hours ago
- #research dataset
- #mathematics benchmarks
- #LLM evaluation
- A group of 49 mathematicians created a dataset of 100 research-level mathematics questions with known answers between April 1 and May 15, 2026.
- The work primarily took place during a 3-day workshop in Leipzig, Germany, with 35 participants at the Max Planck Institute for Mathematics in the Sciences.
- The questions were evaluated in three stages using state-of-the-art LLMs, with the number of unsolved questions dropping from 41 after Stage 1 to 16 after Stage 2, and finally to only 2 after Stage 3.
- The results demonstrate that the mathematical reasoning capabilities of large language models (LLMs) are becoming impressively advanced.