Speculative KV coding: losslessly compressing KV cache by up to ~4×
3 days ago
- #p
- #e
- #r
- #a
- #c
- #K
- #m
- #o
- #h
- #M
- #i
- #V
- #
- #,
- #f
- #L
- #s
- #n
- #l
- Introduces Speculative KV coding, a lossless compression method for KV cache using a predictor model to achieve up to ~4× compression.
- Compression works by running a cheaper predictor model in parallel on encode/decode sides and using arithmetic coding based on prediction quality.
- KV cache compression is needed because growing context sizes in LLMs make storing and moving cache memory-intensive.
- Lossless compression avoids quality degradation issues of lossy methods like TurboQuant, which reduces bit-width but impacts quality unpredictably.
- Method uses a Gaussian model for per-scalar predictions with variance, optimizing bitrate based on prediction error, enhanced with a three-component mixture for better handling of residuals.
- Experiments with Qwen3 model family show compression ratios from 2.37× to 2.70× for bf16 caches, improving with larger model sizes.
- When applied to FP8 KV caches, compression ratios reach 3.08× to 3.90× over raw FP8, totaling 6× to 8× compression on original bf16 cache.
- Future improvements include better residual models, different predictor models (e.g., smaller transformers with linear mapping), and engineering for throughput and bit-identical predictors.
- Potential applications: cross-datacenter disaggregated prefill, bigger prefix caches, and host-RAM offload, trading compute for bandwidth or memory efficiency.