Reasoning Models Reason Well, Until They Don't

7 months ago

Large language models (LLMs) show progress in reasoning tasks but fail at higher complexity.
Large reasoning models (LRMs) are fine-tuned for step-by-step reasoning and self-verification.
LRMs perform well on benchmarks like NLGraph but struggle with more complex problems.
A new dataset, Deep Reasoning Dataset (DeepRD), is introduced to evaluate scalable complexity.
LRMs' performance drops abruptly at sufficient complexity and lacks generalization.
Real-world knowledge graphs mostly fall within LRMs' success regime, but long tails reveal failure potential.
The study highlights LRMs' utility but calls for new methods to handle higher complexity.

Hasty Briefsbeta