Opus 4.6 hallucinates twice as more today than when it released
5 hours ago
- #Code Analysis
- #AI Hallucination
- #Benchmark
- AI models are tested for hallucination in code analysis with 30 tasks, 6 clusters, and 175 questions, verified by code execution and ground truth.
- Grok 4.20 Reasoning scores highest with 91.8, 90.0% accuracy, and 10.0% fabrication rate, showing the least false claims.
- Fabrication rates vary widely, from 10.0% in top models to nearly 50% in lower-ranked ones like GPT-4o Mini and MiniMax M2.5.
- The ranking includes 27 AI models, with scores, accuracy, and fabrication percentages listed for comparison.
- Updated data as of 2026-04-12 provides a snapshot of model performance in reducing false claims during code analysis.