Hasty Briefsbeta

Bilingual

Five frontier LLMs disagree on 67% of 1k real-world fact-check claims

7 hours ago
  • #LLM disagreement
  • #AI evaluation
  • #fact-checking
  • 67% of real-world user fact-checks show disagreement among five top frontier LLMs (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, Sonar Pro).
  • 34% of claims involve substantive disagreements where models' verdicts differ by two or more categories (True/Mostly True/Misleading/False).
  • Models show different behavioral priors, with some concentrating verdicts at the True/False poles and others distributing more in the middle categories.
  • Inter-rater reliability (Krippendorff's α) is 0.639, indicating structured but limited agreement; unanimous verdicts are mostly True or False, rarely the nuanced categories.
  • The study uses 1,000 recent, real user-submitted claims from a fact-checking platform, excluding benchmark contamination, with forced-choice prompts and no ground truth for comparison.