The Illusion of the Illusion of Thinking – A Comment on Shojaee et al. (2025)

a year ago

Shojaee et al. (2025) report 'accuracy collapse' in Large Reasoning Models (LRMs) on complex planning puzzles.
The study identifies three experimental design limitations affecting the reported findings:
1. Tower of Hanoi experiments exceed model output token limits, with models acknowledging constraints.
2. Automated evaluation misclassifies reasoning failures vs. practical constraints.
3. River Crossing benchmarks include unsolvable problems, yet models are scored as failures.
When controlling for these artifacts, models show high accuracy on previously failed Tower of Hanoi instances.
Highlights the importance of careful experimental design in evaluating AI reasoning capabilities.

Hasty Briefsbeta