Hasty Briefsbeta

Bilingual

Differences in link hallucination and source comprehension across different LLM

a year ago
  • #AI
  • #LLMs
  • #Fact-Checking
  • The author discusses differences in link hallucination and source comprehension across large language models (LLMs), focusing on their ability to accurately cite and summarize real-world documents.
  • A real-world example is used to test LLMs: evaluating the claim from the MAHA Report about the effectiveness of stimulants for ADHD, based on the MTA study.
  • The MTA study initially showed benefits of stimulants at 14 months, but follow-ups at 3 years found no sustained differences, though methodological issues complicate interpretations.
  • Different LLMs (Claude, Gemini, ChatGPT) are tested for their ability to correctly interpret and cite the MTA study, with varying results.
  • Claude Sonnet 3.7, 4, and Opus all fail to correctly interpret the study, while ChatGPT o3 performs surprisingly well.
  • Gemini 2.5 and ChatGPT 4.1 struggle with link hallucination, providing incorrect or irrelevant sources.
  • SIFT Toolbox, a contextualization engine, is used to improve model performance, but issues with link hallucination persist in some models.
  • Claude Sonnet 4, when used with SIFT Toolbox, provides the best summary and accurate sources, with no hallucinated links.
  • The author emphasizes the need for systematic testing of hallucinations and sourcing in LLMs, noting that models with fewer hallucinated links tend to provide better answers.