Context Rot: How increasing input tokens impacts LLM performance

10 months ago

Recent LLMs show trends toward longer context windows, with input tokens reaching millions.
Performance on benchmarks like Needle in a Haystack (NIAH) is often assumed uniform, but NIAH is a simple retrieval task.
Extended NIAH tasks explore semantic matches and variations in haystack content, revealing performance degradation with longer inputs.
Models struggle with non-lexical matching, distractors, and haystack structure, impacting real-world applications.
LongMemEval benchmark tests conversational QA, showing performance drops when models must retrieve from long contexts.
Repeated words task demonstrates autoregressive models' difficulty in maintaining accuracy as output length scales with input.
Performance degradation is non-uniform across models, with some refusing tasks or generating random outputs.
Context engineering is crucial for reliable performance, as how information is presented affects model behavior.

Hasty Briefsbeta