Hasty Briefsbeta

Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark

9 days ago
  • #AI Efficiency
  • #Token Optimization
  • #Reasoning Models
  • Large Reasoning Models (LRMs) use test-time scaling and reinforcement learning to enhance problem-solving with extended chains of thought (CoT).
  • Token efficiency, the ratio of tokens used for reasoning to the solution, is a critical but often overlooked factor in model performance.
  • Closed models (e.g., OpenAI, Grok-4) optimize for fewer tokens to reduce costs, while open models (e.g., DeepSeek, Qwen) use more tokens, potentially for better reasoning.
  • Open weight models use 1.5–4× more tokens than closed ones, with up to 10× excess for simple knowledge questions.
  • Token efficiency impacts costs, latency, and context window usage, making it a key metric for practical deployment.
  • The study systematically evaluates token efficiency across knowledge questions, math problems, and logic puzzles.
  • Closed models lead in token efficiency for math, while open models like llama-3.3-nemotron-super-49b-v1 show competitive efficiency.
  • OpenAI's gpt-oss models set a new standard for token efficiency in open weight models, with extremely short CoT.
  • The efficiency gap varies by domain: most pronounced for knowledge questions (3×), less so for math (2×) and logic puzzles.
  • Closed models iteratively optimize token usage, while open models prioritize reasoning performance over efficiency.
  • The release of gpt-oss models provides a reference for optimizing token usage in other open weight models.