Hasty Briefsbeta

Bilingual

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

3 hours ago
  • #sequential compression
  • #KV cache compression
  • #probabilistic language tries
  • The paper introduces sequential KV compression, a two-layer architecture for compressing transformer key-value caches by treating them as sequences from a formal language.
  • The first layer uses probabilistic prefix deduplication based on Probabilistic Language Tries to identify semantically equivalent shared prefixes across sessions.
  • The second layer applies predictive delta coding to store residuals between new KV vectors and the model's predictions, achieving a per-token entropy bound tied to language model perplexity.
  • Theoretical compression ratios are estimated at up to 914,000x over previous methods like TurboQuant at the Shannon limit, with improvements as context length grows.
  • The proposed method is orthogonal to existing per-vector quantization techniques, allowing integration with methods such as TurboQuant.