Even GPT-5.2 Can't Count to Five: Zero-Error Horizons in Trustworthy LLMs

5 hours ago

Introduces Zero-Error Horizon (ZEH) as a metric for trustworthy LLMs, defined as the maximum range an LLM can solve without any errors.
Evaluates ZEH of state-of-the-art LLMs like GPT-5.2, revealing surprising failures on simple tasks (e.g., computing parity of '11000', checking balanced parentheses).
Highlights that ZEH provides insights into the emergence of algorithmic capabilities and differs from accuracy despite some correlation.
Applies ZEH to Qwen2.5 for detailed analysis, showing it offers clues about model capabilities beyond accuracy metrics.
Addresses computational costs of ZEH, proposing mitigation via tree structures and online softmax for up to 10x speedup.

Hasty Briefsbeta