Opus 4.6 hallucinates twice as more today than when it released

5 hours ago

AI models are tested for hallucination in code analysis with 30 tasks, 6 clusters, and 175 questions, verified by code execution and ground truth.
Grok 4.20 Reasoning scores highest with 91.8, 90.0% accuracy, and 10.0% fabrication rate, showing the least false claims.
Fabrication rates vary widely, from 10.0% in top models to nearly 50% in lower-ranked ones like GPT-4o Mini and MiniMax M2.5.
The ranking includes 27 AI models, with scores, accuracy, and fabrication percentages listed for comparison.
Updated data as of 2026-04-12 provides a snapshot of model performance in reducing false claims during code analysis.

Hasty Briefsbeta