What if AI doesn't need more RAM but better math?

12 hours ago

TurboQuant is a two-stage algorithm by Google that compresses the KV cache in LLMs, reducing memory usage by 6x without losing accuracy.
Stage 1, PolarQuant, converts vectors to polar coordinates, leveraging predictable angle distributions to compress efficiently without fine-tuning.
Stage 2, QJL (Quantised Johnson-Lindenstrauss), applies a transform to correct quantisation errors, preserving distances with zero memory overhead.
The KV cache is a memory bottleneck in LLM inference, storing key and value vectors for all previous tokens, which grows with context length.
TurboQuant is data-oblivious, working from first principles without calibration, making it deployable at inference time to any model.
This compression could ease the AI memory crunch, impacting memory stock prices like Micron and SanDisk due to reduced demand forecasts.
Beyond LLMs, TurboQuant may benefit vector databases, recommendation engines, fraud detection, and on-device inference by compressing high-dimensional embeddings.
The algorithm's efficiency could enable longer contexts on edge devices, changing the economics of local AI inference.

Hasty Briefsbeta