Fast KV Compaction via Attention Matching
5 days ago
- #Machine Learning
- #Natural Language Processing
- #Attention Mechanisms
- Scaling language models to long contexts is bottlenecked by the size of the key-value (KV) cache.
- Current methods for managing long contexts involve summarization in token space, which can be lossy and harm performance.
- Recent work on Cartridges shows that compact KV caches in latent space can match full-context performance but require slow, expensive optimization.
- This paper introduces Attention Matching for fast context compaction in latent space, preserving attention outputs and mass per-KV-head.
- The approach decomposes into simple subproblems with efficient closed-form solutions, improving the trade-off between compaction time and quality.
- Results show up to 50x compaction in seconds with minimal quality loss on some datasets.