LLMs are bad at returning code in JSON

9 months ago

LLMs produce lower quality code when returning it as part of a structured JSON response.
Benchmarks show models struggle with syntax errors in JSON-wrapped code, especially with quoting and escaping.
Plain text (markdown) outperforms JSON in code quality and problem-solving capacity.
OpenAI's 'strict' JSON mode offers no improvement over non-strict JSON for code quality.
Models like Claude-3-5-Sonnet and DeepSeek Coder suffer the most from JSON-wrapping.
JSON-wrapping may distract models, reducing their ability to reason about coding problems.
OpenAI's GPT-4o shows the least performance drop when using JSON, but plain text remains superior.

Hasty Briefsbeta