KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
3 hours ago
- #sequential compression
- #KV cache compression
- #probabilistic language tries
- The paper introduces sequential KV compression, a two-layer architecture for compressing transformer key-value caches by treating them as sequences from a formal language.
- The first layer uses probabilistic prefix deduplication based on Probabilistic Language Tries to identify semantically equivalent shared prefixes across sessions.
- The second layer applies predictive delta coding to store residuals between new KV vectors and the model's predictions, achieving a per-token entropy bound tied to language model perplexity.
- Theoretical compression ratios are estimated at up to 914,000x over previous methods like TurboQuant at the Shannon limit, with improvements as context length grows.
- The proposed method is orthogonal to existing per-vector quantization techniques, allowing integration with methods such as TurboQuant.