Fast KV Compaction via Attention Matching

5 days ago

Scaling language models to long contexts is bottlenecked by the size of the key-value (KV) cache.
Current methods for managing long contexts involve summarization in token space, which can be lossy and harm performance.
Recent work on Cartridges shows that compact KV caches in latent space can match full-context performance but require slow, expensive optimization.
This paper introduces Attention Matching for fast context compaction in latent space, preserving attention outputs and mass per-KV-head.
The approach decomposes into simple subproblems with efficient closed-form solutions, improving the trade-off between compaction time and quality.
Results show up to 50x compaction in seconds with minimal quality loss on some datasets.

Hasty Briefsbeta