The Illusion of the Illusion of Thinking – A Comment on Shojaee et al. (2025)
a year ago
- #Experimental Design
- #Reasoning Models
- #Artificial Intelligence
- Shojaee et al. (2025) report 'accuracy collapse' in Large Reasoning Models (LRMs) on complex planning puzzles.
- The study identifies three experimental design limitations affecting the reported findings:
- 1. Tower of Hanoi experiments exceed model output token limits, with models acknowledging constraints.
- 2. Automated evaluation misclassifies reasoning failures vs. practical constraints.
- 3. River Crossing benchmarks include unsolvable problems, yet models are scored as failures.
- When controlling for these artifacts, models show high accuracy on previously failed Tower of Hanoi instances.
- Highlights the importance of careful experimental design in evaluating AI reasoning capabilities.