Hasty Briefsbeta

Bilingual

ANN v3: 200ms p99 query latency over 100B vectors

3 months ago
  • #scalability
  • #machine-learning
  • #vector-search
  • The pursuit of scale is not vanity; optimizing existing systems from first principles can lead to entirely new innovations.
  • Deep learning's explosion over the past decade exemplifies how combining decades-old ideas with hardware advancements and specialization can yield remarkable results.
  • Turbopuffer's Approximate Nearest Neighbor (ANN) Search v3 supports scales of up to 100 billion vectors in a single search index.
  • ANN v3's architecture is designed to handle 200TiB of dense vector data with high query rates (>1k QPS) and low latency (<200ms).
  • The system is bandwidth-bound, with performance limited by the ability to fetch and process large data vectors efficiently.
  • Hierarchical clustering and binary quantization are key techniques used to balance bandwidth demands and utilize cache space effectively.
  • Binary quantization compresses vectors by 16-32x, significantly reducing memory bandwidth requirements and improving throughput.
  • The RaBitQ quantization method preserves high recall by exploiting the mathematical properties of high-dimensional spaces.
  • Distribution across storage-dense machines allows the system to scale to arbitrarily large indexes while maintaining efficiency.
  • ANN v3 achieves 100 billion-vector scale at thousands of QPS with p99 latency under 200ms, making it suitable for production use.