Hasty Briefsbeta

Bilingual

Better Models: Worse Tools

4 hours ago
  • #Model regression
  • #Schema compliance
  • #LLM tool calling
  • Newer Claude models (Opus 4.8, Sonnet 5) sometimes call Pi's edit tool with extra, invented fields in the edits array, causing schema violations and rejection, unlike older models.
  • The issue likely stems from post-training in the Claude Code harness, which is forgiving and absorbs errors, reinforcing malformed calls, making models adapted to its specific schema.
  • Strict mode in Anthropic fixes the problem by constraining decoding, but this reveals that tool schemas are not neutral and may be influenced by the dominant training environment.
  • The behavior is context-dependent, reproducible in agentic histories (e.g., from user transcripts), and involves random invented keys like requireUnique, oldText2, or type.
  • This regression contrasts with models like Opus 4.5, which adapted well to alternative tool shapes, raising concerns about transferability to other harnesses.
  • Comparison with OpenAI's Harmony format shows in-band signalling for JSON-constrained sampling, whereas Anthropic's closed models and harness limit transparency and control.