Reasoning Models Reason Well, Until They Don't
6 months ago
- #Reasoning
- #Large Language Models
- #Artificial Intelligence
- Large language models (LLMs) show progress in reasoning tasks but fail at higher complexity.
- Large reasoning models (LRMs) are fine-tuned for step-by-step reasoning and self-verification.
- LRMs perform well on benchmarks like NLGraph but struggle with more complex problems.
- A new dataset, Deep Reasoning Dataset (DeepRD), is introduced to evaluate scalable complexity.
- LRMs' performance drops abruptly at sufficient complexity and lacks generalization.
- Real-world knowledge graphs mostly fall within LRMs' success regime, but long tails reveal failure potential.
- The study highlights LRMs' utility but calls for new methods to handle higher complexity.