Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark
9 days ago
- #AI Efficiency
- #Token Optimization
- #Reasoning Models
- Large Reasoning Models (LRMs) use test-time scaling and reinforcement learning to enhance problem-solving with extended chains of thought (CoT).
- Token efficiency, the ratio of tokens used for reasoning to the solution, is a critical but often overlooked factor in model performance.
- Closed models (e.g., OpenAI, Grok-4) optimize for fewer tokens to reduce costs, while open models (e.g., DeepSeek, Qwen) use more tokens, potentially for better reasoning.
- Open weight models use 1.5–4× more tokens than closed ones, with up to 10× excess for simple knowledge questions.
- Token efficiency impacts costs, latency, and context window usage, making it a key metric for practical deployment.
- The study systematically evaluates token efficiency across knowledge questions, math problems, and logic puzzles.
- Closed models lead in token efficiency for math, while open models like llama-3.3-nemotron-super-49b-v1 show competitive efficiency.
- OpenAI's gpt-oss models set a new standard for token efficiency in open weight models, with extremely short CoT.
- The efficiency gap varies by domain: most pronounced for knowledge questions (3×), less so for math (2×) and logic puzzles.
- Closed models iteratively optimize token usage, while open models prioritize reasoning performance over efficiency.
- The release of gpt-oss models provides a reference for optimizing token usage in other open weight models.