Hasty Briefsbeta

Bilingual

Fast KV Compaction via Attention Matching

5 days ago
  • #Machine Learning
  • #Natural Language Processing
  • #Attention Mechanisms
  • Scaling language models to long contexts is bottlenecked by the size of the key-value (KV) cache.
  • Current methods for managing long contexts involve summarization in token space, which can be lossy and harm performance.
  • Recent work on Cartridges shows that compact KV caches in latent space can match full-context performance but require slow, expensive optimization.
  • This paper introduces Attention Matching for fast context compaction in latent space, preserving attention outputs and mass per-KV-head.
  • The approach decomposes into simple subproblems with efficient closed-form solutions, improving the trade-off between compaction time and quality.
  • Results show up to 50x compaction in seconds with minimal quality loss on some datasets.