AbsenceBench: Language models can't tell what's missing
10 months ago
- #Benchmark
- #LLMs
- #Transformer
- Large language models (LLMs) perform well at recalling surprising information but struggle to identify omitted information.
- AbsenceBench is introduced to assess LLMs' ability to detect missing information in numerical sequences, poetry, and GitHub pull requests.
- Even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score on AbsenceBench with an average context length of 5K tokens.
- Transformer attention mechanisms have a fundamental limitation: they cannot easily attend to 'gaps' in documents since absences don't correspond to specific keys.
- The study highlights the close proximity of tasks where models excel (e.g., NIAH) and tasks where they unexpectedly fail (e.g., AbsenceBench).