TurboQuant: A First-Principles Walkthrough
5 hours ago
- #Machine learning efficiency
- #AI vector compression
- #Quantization
- TurboQuant compresses AI vectors (KV caches, embeddings, attention keys) to 2–4 bits per coordinate without losing accuracy.
- It uses a random rotation to transform vectors so each coordinate follows a known fixed distribution (Beta/Gaussian), enabling a single reusable codebook.
- Lloyd–Max algorithm precomputes optimal codebooks for this distribution, eliminating per-block metadata and calibration.
- The MSE-optimal version minimizes reconstruction error but biases inner products by shrinking estimates.
- TurboQuant-prod combines MSE quantization with a QJL residual step to produce unbiased inner-product estimates.
- It achieves exponential compression rate (4^{-b}), outperforming polynomial-rate methods.
- TurboQuant matches full-precision performance in LLM inference (KV cache compression) and is orders of magnitude faster in nearest-neighbor search.
- The construction is data-oblivious, requires no per-vector side information beyond one scalar norm, and runs efficiently on GPUs.