Five frontier LLMs disagree on 67% of 1k real-world fact-check claims
7 hours ago
- #LLM disagreement
- #AI evaluation
- #fact-checking
- 67% of real-world user fact-checks show disagreement among five top frontier LLMs (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, Sonar Pro).
- 34% of claims involve substantive disagreements where models' verdicts differ by two or more categories (True/Mostly True/Misleading/False).
- Models show different behavioral priors, with some concentrating verdicts at the True/False poles and others distributing more in the middle categories.
- Inter-rater reliability (Krippendorff's α) is 0.639, indicating structured but limited agreement; unanimous verdicts are mostly True or False, rarely the nuanced categories.
- The study uses 1,000 recent, real user-submitted claims from a fact-checking platform, excluding benchmark contamination, with forced-choice prompts and no ground truth for comparison.