Show HN: We cut RAG latency ~2× by switching embedding model
a day ago
- #Embeddings
- #RAG
- #Latency
- MyClone migrated from OpenAI text-embedding-3-small (1536d) to Voyage-3.5-lite (512d) to achieve 3× storage savings, 2× faster retrieval, and 15-20% reduction in voice latency.
- Voyage-3.5-lite leverages Matryoshka Representation Learning (MRL) to maintain retrieval quality despite lower dimensions.
- The switch resulted in a ~66% reduction in storage footprint and 50% faster retrieval latency.
- End-to-end voice latency improved by 15-20%, and first-token latency dropped by 15%.
- Voyage-3.5-lite's flexibility in dimensions and quantization allows for future optimizations.
- The migration improved user experience, reduced infrastructure costs, and provided headroom for richer features.