Better Models: Worse Tools
5 hours ago
- #Model regression
- #Schema compliance
- #LLM tool calling
- Newer Claude models (Opus 4.8, Sonnet 5) sometimes call Pi's edit tool with extra, invented fields in the edits array, causing schema violations and rejection, unlike older models.
- The issue likely stems from post-training in the Claude Code harness, which is forgiving and absorbs errors, reinforcing malformed calls, making models adapted to its specific schema.
- Strict mode in Anthropic fixes the problem by constraining decoding, but this reveals that tool schemas are not neutral and may be influenced by the dominant training environment.
- The behavior is context-dependent, reproducible in agentic histories (e.g., from user transcripts), and involves random invented keys like requireUnique, oldText2, or type.
- This regression contrasts with models like Opus 4.5, which adapted well to alternative tool shapes, raising concerns about transferability to other harnesses.
- Comparison with OpenAI's Harmony format shows in-band signalling for JSON-constrained sampling, whereas Anthropic's closed models and harness limit transparency and control.