Five frontier LLMs disagree on 67% of 1k real-world fact-check claims

7 hours ago

67% of real-world user fact-checks show disagreement among five top frontier LLMs (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, Sonar Pro).
34% of claims involve substantive disagreements where models' verdicts differ by two or more categories (True/Mostly True/Misleading/False).
Models show different behavioral priors, with some concentrating verdicts at the True/False poles and others distributing more in the middle categories.
Inter-rater reliability (Krippendorff's α) is 0.639, indicating structured but limited agreement; unanimous verdicts are mostly True or False, rarely the nuanced categories.
The study uses 1,000 recent, real user-submitted claims from a fact-checking platform, excluding benchmark contamination, with forced-choice prompts and no ground truth for comparison.

Hasty Briefsbeta