Hasty Briefsbeta

Expected Attention: KV Cache Compression by Estimating Attention

8 hours ago
  • #Machine Learning
  • #Natural Language Processing
  • #Artificial Intelligence
  • Introduces Expected Attention, a training-free method for KV cache compression in large language models.
  • Estimates KV pair importance by predicting future queries' attention, leveraging LLM activations' distributional properties.
  • Operates seamlessly across prefilling and decoding phases, outperforming state-of-the-art baselines.
  • Releases KVPress, a library for implementing and benchmarking KV cache compression methods, including over 20 techniques.