Expected Attention: KV Cache Compression by Estimating Attention
8 hours ago
- #Machine Learning
- #Natural Language Processing
- #Artificial Intelligence
- Introduces Expected Attention, a training-free method for KV cache compression in large language models.
- Estimates KV pair importance by predicting future queries' attention, leveraging LLM activations' distributional properties.
- Operates seamlessly across prefilling and decoding phases, outperforming state-of-the-art baselines.
- Releases KVPress, a library for implementing and benchmarking KV cache compression methods, including over 20 techniques.