Hasty Briefsbeta

Bilingual

Benchmarks in Leipzig

3 hours ago
  • #research dataset
  • #mathematics benchmarks
  • #LLM evaluation
  • A group of 49 mathematicians created a dataset of 100 research-level mathematics questions with known answers between April 1 and May 15, 2026.
  • The work primarily took place during a 3-day workshop in Leipzig, Germany, with 35 participants at the Max Planck Institute for Mathematics in the Sciences.
  • The questions were evaluated in three stages using state-of-the-art LLMs, with the number of unsolved questions dropping from 41 after Stage 1 to 16 after Stage 2, and finally to only 2 after Stage 3.
  • The results demonstrate that the mathematical reasoning capabilities of large language models (LLMs) are becoming impressively advanced.