How I Made My Vector Search Engine 16x Faster Without Changing the Algorithm
a day ago
- #CPU Performance
- #Vector Search Optimization
- #Data Layout
- Achieved significant performance improvements in a vector search engine without algorithm changes: query latency reduced from 25.15ms to 1.524ms (16.5x faster), build time improved from 17.91s to 1.889s (9.48x faster).
- Optimized data layout by replacing object-heavy structures with flat arrays and lightweight FloatVectorView, reducing pointer chasing and virtual dispatch, enabling CPU-friendly contiguous memory access.
- Utilized squared distances instead of Euclidean distances to eliminate costly square root calculations, maintaining same ordering for nearest-neighbor comparisons.
- Implemented caching of candidate scores during sorting, preventing repeated recomputation of distances for each comparator call, reducing overhead.
- Assembly analysis showed transformation from virtual function calls and scalar loops to SIMD operations like movups, subps, mulps, leveraging packed single-precision float instructions.
- Key insight: performance hidden in data layout; Big-O alone insufficient; CPU efficiency depends on reducing indirection, caching, and avoiding redundant operations in hot paths.