Hasty Briefsbeta

Bilingual

TurboQuant: A First-Principles Walkthrough

7 hours ago
  • #Machine learning efficiency
  • #AI vector compression
  • #Quantization
  • TurboQuant compresses AI vectors (KV caches, embeddings, attention keys) to 2–4 bits per coordinate without losing accuracy.
  • It uses a random rotation to transform vectors so each coordinate follows a known fixed distribution (Beta/Gaussian), enabling a single reusable codebook.
  • Lloyd–Max algorithm precomputes optimal codebooks for this distribution, eliminating per-block metadata and calibration.
  • The MSE-optimal version minimizes reconstruction error but biases inner products by shrinking estimates.
  • TurboQuant-prod combines MSE quantization with a QJL residual step to produce unbiased inner-product estimates.
  • It achieves exponential compression rate (4^{-b}), outperforming polynomial-rate methods.
  • TurboQuant matches full-precision performance in LLM inference (KV cache compression) and is orders of magnitude faster in nearest-neighbor search.
  • The construction is data-oblivious, requires no per-vector side information beyond one scalar norm, and runs efficiently on GPUs.