ANN v3: 200ms p99 query latency over 100B vectors
3 months ago
- #scalability
- #machine-learning
- #vector-search
- The pursuit of scale is not vanity; optimizing existing systems from first principles can lead to entirely new innovations.
- Deep learning's explosion over the past decade exemplifies how combining decades-old ideas with hardware advancements and specialization can yield remarkable results.
- Turbopuffer's Approximate Nearest Neighbor (ANN) Search v3 supports scales of up to 100 billion vectors in a single search index.
- ANN v3's architecture is designed to handle 200TiB of dense vector data with high query rates (>1k QPS) and low latency (<200ms).
- The system is bandwidth-bound, with performance limited by the ability to fetch and process large data vectors efficiently.
- Hierarchical clustering and binary quantization are key techniques used to balance bandwidth demands and utilize cache space effectively.
- Binary quantization compresses vectors by 16-32x, significantly reducing memory bandwidth requirements and improving throughput.
- The RaBitQ quantization method preserves high recall by exploiting the mathematical properties of high-dimensional spaces.
- Distribution across storage-dense machines allows the system to scale to arbitrarily large indexes while maintaining efficiency.
- ANN v3 achieves 100 billion-vector scale at thousands of QPS with p99 latency under 200ms, making it suitable for production use.