OpenData Vector: MIT-Licensed Vector Search on Object Storage
5 hours ago
- #vector-search
- #object-storage
- #open-source
- OpenData Vector is an MIT-licensed vector search engine designed to run on object storage, offering a cost-effective alternative to services like pgvector or proprietary vendors, with estimated costs of about $350/month for 100M vectors.
- It adopts a stateless, third-generation architecture that leverages object storage for durability and metadata, simplifying operation and enabling high availability without node coordination, compared to earlier tiered or disaggregated systems.
- Key architectural decisions include IVF indexing for batch data fetching from object storage, LSM-based LIRE compaction for append-only updates, and a share-everything state model via SlateDB, ensuring consistency and minimizing direct node communication.
- Deployment flexibility ranges from embedded setups to single-node, writer-reader separation, and buffered ingest topologies, allowing users to balance cost and complexity based on availability and performance needs.
- Tradeoffs involve higher warm query latency (~10ms vs. sub-millisecond for HNSW) but faster cold queries (sub-second), and write latency up to a second due to batching, which can be reduced with OpenData Buffer to ~100ms without read-your-writes guarantees.
- Benchmarks on a c6id.4xlarge node show warm query latencies in low milliseconds for smaller datasets and low teens for larger ones, with P90 cold queries under 1 second, and ingestion throughput varying from ~1K to ~12K vectors/second depending on dataset size and dimensions.
- Future enhancements include support for smaller vector data types, quantization to improve retrieval performance, reducing S3 round trips for cold queries, and adding full-text search capabilities for combined semantic and text search.