Hasty Briefsbeta

Bilingual

Speculative KV coding: losslessly compressing KV cache by up to ~4×

3 days ago
  • #p
  • #e
  • #r
  • #a
  • #c
  • #K
  • #m
  • #o
  • #h
  • #M
  • #i
  • #V
  • #
  • #,
  • #f
  • #L
  • #s
  • #n
  • #l
  • Introduces Speculative KV coding, a lossless compression method for KV cache using a predictor model to achieve up to ~4× compression.
  • Compression works by running a cheaper predictor model in parallel on encode/decode sides and using arithmetic coding based on prediction quality.
  • KV cache compression is needed because growing context sizes in LLMs make storing and moving cache memory-intensive.
  • Lossless compression avoids quality degradation issues of lossy methods like TurboQuant, which reduces bit-width but impacts quality unpredictably.
  • Method uses a Gaussian model for per-scalar predictions with variance, optimizing bitrate based on prediction error, enhanced with a three-component mixture for better handling of residuals.
  • Experiments with Qwen3 model family show compression ratios from 2.37× to 2.70× for bf16 caches, improving with larger model sizes.
  • When applied to FP8 KV caches, compression ratios reach 3.08× to 3.90× over raw FP8, totaling 6× to 8× compression on original bf16 cache.
  • Future improvements include better residual models, different predictor models (e.g., smaller transformers with linear mapping), and engineering for throughput and bit-identical predictors.
  • Potential applications: cross-datacenter disaggregated prefill, bigger prefix caches, and host-RAM offload, trading compute for bandwidth or memory efficiency.