Hasty Briefsbeta

Bilingual

14× faster embeddings: how we rebuilt the ONNX path in Manticore

4 hours ago
  • #embedding-models
  • #inference-optimization
  • #database-performance
  • Auto Embeddings feature in Manticore Search originally used SentenceTransformers on Candle, but performance was limited to 5–11 docs/sec across various configurations.
  • The new ONNX Runtime backend, shipped in version 27.1.5, is ~14× faster on average, achieving 70–230 docs/sec with the same hardware and model.
  • Key changes include disabling intra_op_spinning to avoid CPU busy-waiting and abandoning batching within workers due to padding overhead, instead processing documents individually.
  • A thread-safe shared ONNX session is used on Linux/macOS, with adaptive parallelism: single-doc INSERTs take a fast path, while bulk operations parallelize across workers.
  • Performance tuning: For maximum throughput, use large batch sizes (32–128) with a single client thread, achieving up to 233 docs/sec; single-row INSERTs now reach 72 docs/sec.
  • The migration to ONNX requires no API changes for existing tables using ONNX-capable models; switching models involves adding a new column or dumping and reloading data.
  • Future improvements include GPU support, Windows performance parity, and extending the ONNX path to more model architectures like T5 and GGUF.