Hasty Briefsbeta

Bilingual

AbsenceBench: Language models can't tell what's missing

10 months ago
  • #Benchmark
  • #LLMs
  • #Transformer
  • Large language models (LLMs) perform well at recalling surprising information but struggle to identify omitted information.
  • AbsenceBench is introduced to assess LLMs' ability to detect missing information in numerical sequences, poetry, and GitHub pull requests.
  • Even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score on AbsenceBench with an average context length of 5K tokens.
  • Transformer attention mechanisms have a fundamental limitation: they cannot easily attend to 'gaps' in documents since absences don't correspond to specific keys.
  • The study highlights the close proximity of tasks where models excel (e.g., NIAH) and tasks where they unexpectedly fail (e.g., AbsenceBench).