OpenData Vector: MIT-Licensed Vector Search on Object Storage

5 hours ago

#vector-search
#object-storage
#open-source

OpenData Vector is an MIT-licensed vector search engine designed to run on object storage, offering a cost-effective alternative to services like pgvector or proprietary vendors, with estimated costs of about $350/month for 100M vectors.
It adopts a stateless, third-generation architecture that leverages object storage for durability and metadata, simplifying operation and enabling high availability without node coordination, compared to earlier tiered or disaggregated systems.
Key architectural decisions include IVF indexing for batch data fetching from object storage, LSM-based LIRE compaction for append-only updates, and a share-everything state model via SlateDB, ensuring consistency and minimizing direct node communication.
Deployment flexibility ranges from embedded setups to single-node, writer-reader separation, and buffered ingest topologies, allowing users to balance cost and complexity based on availability and performance needs.
Tradeoffs involve higher warm query latency (~10ms vs. sub-millisecond for HNSW) but faster cold queries (sub-second), and write latency up to a second due to batching, which can be reduced with OpenData Buffer to ~100ms without read-your-writes guarantees.
Benchmarks on a c6id.4xlarge node show warm query latencies in low milliseconds for smaller datasets and low teens for larger ones, with P90 cold queries under 1 second, and ingestion throughput varying from ~1K to ~12K vectors/second depending on dataset size and dimensions.
Future enhancements include support for smaller vector data types, quantization to improve retrieval performance, reducing S3 round trips for cold queries, and adding full-text search capabilities for combined semantic and text search.

Hasty Briefsbeta

OpenData Vector: MIT-Licensed Vector Search on Object Storage