KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

3 hours ago

The paper introduces sequential KV compression, a two-layer architecture for compressing transformer key-value caches by treating them as sequences from a formal language.
The first layer uses probabilistic prefix deduplication based on Probabilistic Language Tries to identify semantically equivalent shared prefixes across sessions.
The second layer applies predictive delta coding to store residuals between new KV vectors and the model's predictions, achieving a per-token entropy bound tied to language model perplexity.
Theoretical compression ratios are estimated at up to 914,000x over previous methods like TurboQuant at the Shannon limit, with improvements as context length grows.
The proposed method is orthogonal to existing per-vector quantization techniques, allowing integration with methods such as TurboQuant.

Hasty Briefsbeta