Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark

9 days ago

Copy Link

Large Reasoning Models (LRMs) use test-time scaling and reinforcement learning to enhance problem-solving with extended chains of thought (CoT).
Token efficiency, the ratio of tokens used for reasoning to the solution, is a critical but often overlooked factor in model performance.
Closed models (e.g., OpenAI, Grok-4) optimize for fewer tokens to reduce costs, while open models (e.g., DeepSeek, Qwen) use more tokens, potentially for better reasoning.
Open weight models use 1.5–4× more tokens than closed ones, with up to 10× excess for simple knowledge questions.
Token efficiency impacts costs, latency, and context window usage, making it a key metric for practical deployment.
The study systematically evaluates token efficiency across knowledge questions, math problems, and logic puzzles.
Closed models lead in token efficiency for math, while open models like llama-3.3-nemotron-super-49b-v1 show competitive efficiency.
OpenAI's gpt-oss models set a new standard for token efficiency in open weight models, with extremely short CoT.
The efficiency gap varies by domain: most pronounced for knowledge questions (3×), less so for math (2×) and logic puzzles.
Closed models iteratively optimize token usage, while open models prioritize reasoning performance over efficiency.
The release of gpt-oss models provides a reference for optimizing token usage in other open weight models.

Hasty Briefsbeta