Hasty Briefsbeta

Bilingual

Context Rot: How increasing input tokens impacts LLM performance

10 months ago
  • #long-context
  • #LLMs
  • #benchmarking
  • Recent LLMs show trends toward longer context windows, with input tokens reaching millions.
  • Performance on benchmarks like Needle in a Haystack (NIAH) is often assumed uniform, but NIAH is a simple retrieval task.
  • Extended NIAH tasks explore semantic matches and variations in haystack content, revealing performance degradation with longer inputs.
  • Models struggle with non-lexical matching, distractors, and haystack structure, impacting real-world applications.
  • LongMemEval benchmark tests conversational QA, showing performance drops when models must retrieve from long contexts.
  • Repeated words task demonstrates autoregressive models' difficulty in maintaining accuracy as output length scales with input.
  • Performance degradation is non-uniform across models, with some refusing tasks or generating random outputs.
  • Context engineering is crucial for reliable performance, as how information is presented affects model behavior.