Context Rot: How increasing input tokens impacts LLM performance
10 months ago
- #long-context
- #LLMs
- #benchmarking
- Recent LLMs show trends toward longer context windows, with input tokens reaching millions.
- Performance on benchmarks like Needle in a Haystack (NIAH) is often assumed uniform, but NIAH is a simple retrieval task.
- Extended NIAH tasks explore semantic matches and variations in haystack content, revealing performance degradation with longer inputs.
- Models struggle with non-lexical matching, distractors, and haystack structure, impacting real-world applications.
- LongMemEval benchmark tests conversational QA, showing performance drops when models must retrieve from long contexts.
- Repeated words task demonstrates autoregressive models' difficulty in maintaining accuracy as output length scales with input.
- Performance degradation is non-uniform across models, with some refusing tasks or generating random outputs.
- Context engineering is crucial for reliable performance, as how information is presented affects model behavior.