Hasty Briefsbeta

Bilingual

Opus 4.6 hallucinates twice as more today than when it released

5 hours ago
  • #Code Analysis
  • #AI Hallucination
  • #Benchmark
  • AI models are tested for hallucination in code analysis with 30 tasks, 6 clusters, and 175 questions, verified by code execution and ground truth.
  • Grok 4.20 Reasoning scores highest with 91.8, 90.0% accuracy, and 10.0% fabrication rate, showing the least false claims.
  • Fabrication rates vary widely, from 10.0% in top models to nearly 50% in lower-ranked ones like GPT-4o Mini and MiniMax M2.5.
  • The ranking includes 27 AI models, with scores, accuracy, and fabrication percentages listed for comparison.
  • Updated data as of 2026-04-12 provides a snapshot of model performance in reducing false claims during code analysis.