14× faster embeddings: how we rebuilt the ONNX path in Manticore
4 hours ago
- #embedding-models
- #inference-optimization
- #database-performance
- Auto Embeddings feature in Manticore Search originally used SentenceTransformers on Candle, but performance was limited to 5–11 docs/sec across various configurations.
- The new ONNX Runtime backend, shipped in version 27.1.5, is ~14× faster on average, achieving 70–230 docs/sec with the same hardware and model.
- Key changes include disabling intra_op_spinning to avoid CPU busy-waiting and abandoning batching within workers due to padding overhead, instead processing documents individually.
- A thread-safe shared ONNX session is used on Linux/macOS, with adaptive parallelism: single-doc INSERTs take a fast path, while bulk operations parallelize across workers.
- Performance tuning: For maximum throughput, use large batch sizes (32–128) with a single client thread, achieving up to 233 docs/sec; single-row INSERTs now reach 72 docs/sec.
- The migration to ONNX requires no API changes for existing tables using ONNX-capable models; switching models involves adding a new column or dumping and reloading data.
- Future improvements include GPU support, Windows performance parity, and extending the ONNX path to more model architectures like T5 and GGUF.