Differences in link hallucination and source comprehension across different LLM

a year ago

The author discusses differences in link hallucination and source comprehension across large language models (LLMs), focusing on their ability to accurately cite and summarize real-world documents.
A real-world example is used to test LLMs: evaluating the claim from the MAHA Report about the effectiveness of stimulants for ADHD, based on the MTA study.
The MTA study initially showed benefits of stimulants at 14 months, but follow-ups at 3 years found no sustained differences, though methodological issues complicate interpretations.
Different LLMs (Claude, Gemini, ChatGPT) are tested for their ability to correctly interpret and cite the MTA study, with varying results.
Claude Sonnet 3.7, 4, and Opus all fail to correctly interpret the study, while ChatGPT o3 performs surprisingly well.
Gemini 2.5 and ChatGPT 4.1 struggle with link hallucination, providing incorrect or irrelevant sources.
SIFT Toolbox, a contextualization engine, is used to improve model performance, but issues with link hallucination persist in some models.
Claude Sonnet 4, when used with SIFT Toolbox, provides the best summary and accurate sources, with no hallucinated links.
The author emphasizes the need for systematic testing of hallucinations and sourcing in LLMs, noting that models with fewer hallucinated links tend to provide better answers.

Hasty Briefsbeta