Speculative KV coding: losslessly compressing KV cache by up to ~4×

3 days ago

Introduces Speculative KV coding, a lossless compression method for KV cache using a predictor model to achieve up to ~4× compression.
Compression works by running a cheaper predictor model in parallel on encode/decode sides and using arithmetic coding based on prediction quality.
KV cache compression is needed because growing context sizes in LLMs make storing and moving cache memory-intensive.
Lossless compression avoids quality degradation issues of lossy methods like TurboQuant, which reduces bit-width but impacts quality unpredictably.
Method uses a Gaussian model for per-scalar predictions with variance, optimizing bitrate based on prediction error, enhanced with a three-component mixture for better handling of residuals.
Experiments with Qwen3 model family show compression ratios from 2.37× to 2.70× for bf16 caches, improving with larger model sizes.
When applied to FP8 KV caches, compression ratios reach 3.08× to 3.90× over raw FP8, totaling 6× to 8× compression on original bf16 cache.
Future improvements include better residual models, different predictor models (e.g., smaller transformers with linear mapping), and engineering for throughput and bit-identical predictors.
Potential applications: cross-datacenter disaggregated prefill, bigger prefix caches, and host-RAM offload, trading compute for bandwidth or memory efficiency.

Hasty Briefsbeta