AbsenceBench: Language models can't tell what's missing

10 months ago

Large language models (LLMs) perform well at recalling surprising information but struggle to identify omitted information.
AbsenceBench is introduced to assess LLMs' ability to detect missing information in numerical sequences, poetry, and GitHub pull requests.
Even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score on AbsenceBench with an average context length of 5K tokens.
Transformer attention mechanisms have a fundamental limitation: they cannot easily attend to 'gaps' in documents since absences don't correspond to specific keys.
The study highlights the close proximity of tasks where models excel (e.g., NIAH) and tasks where they unexpectedly fail (e.g., AbsenceBench).

Hasty Briefsbeta