14× faster embeddings: how we rebuilt the ONNX path in Manticore

4 hours ago

Auto Embeddings feature in Manticore Search originally used SentenceTransformers on Candle, but performance was limited to 5–11 docs/sec across various configurations.
The new ONNX Runtime backend, shipped in version 27.1.5, is ~14× faster on average, achieving 70–230 docs/sec with the same hardware and model.
Key changes include disabling intra_op_spinning to avoid CPU busy-waiting and abandoning batching within workers due to padding overhead, instead processing documents individually.
A thread-safe shared ONNX session is used on Linux/macOS, with adaptive parallelism: single-doc INSERTs take a fast path, while bulk operations parallelize across workers.
Performance tuning: For maximum throughput, use large batch sizes (32–128) with a single client thread, achieving up to 233 docs/sec; single-row INSERTs now reach 72 docs/sec.
The migration to ONNX requires no API changes for existing tables using ONNX-capable models; switching models involves adding a new column or dumping and reloading data.
Future improvements include GPU support, Windows performance parity, and extending the ONNX path to more model architectures like T5 and GGUF.

Hasty Briefsbeta