ANN v3: 200ms p99 query latency over 100B vectors

3 months ago

The pursuit of scale is not vanity; optimizing existing systems from first principles can lead to entirely new innovations.
Deep learning's explosion over the past decade exemplifies how combining decades-old ideas with hardware advancements and specialization can yield remarkable results.
Turbopuffer's Approximate Nearest Neighbor (ANN) Search v3 supports scales of up to 100 billion vectors in a single search index.
ANN v3's architecture is designed to handle 200TiB of dense vector data with high query rates (>1k QPS) and low latency (<200ms).
The system is bandwidth-bound, with performance limited by the ability to fetch and process large data vectors efficiently.
Hierarchical clustering and binary quantization are key techniques used to balance bandwidth demands and utilize cache space effectively.
Binary quantization compresses vectors by 16-32x, significantly reducing memory bandwidth requirements and improving throughput.
The RaBitQ quantization method preserves high recall by exploiting the mathematical properties of high-dimensional spaces.
Distribution across storage-dense machines allows the system to scale to arbitrarily large indexes while maintaining efficiency.
ANN v3 achieves 100 billion-vector scale at thousands of QPS with p99 latency under 200ms, making it suitable for production use.

Hasty Briefsbeta