Voxtral – Frontier open source speech understanding models

10 months ago

Introduction of Voxtral, frontier open-source speech understanding models.
Voice as the original and most natural human-computer interface.
Current limitations of voice systems: unreliable, proprietary, and brittle.
Voxtral models aim to bridge the gap with exceptional transcription, deep understanding, multilingual fluency, and open deployment.
Available in two sizes: 24B for production-scale and 3B for local/edge deployments, both under Apache 2.0 license.
Voxtral offers state-of-the-art accuracy and semantic understanding at less than half the price of comparable APIs.
Capabilities include long-form context (up to 30-40 minutes), built-in Q&A and summarization, multilingual support, and function-calling from voice.
Benchmarks show Voxtral outperforms leading models like Whisper, GPT-4o mini, and Gemini 2.5 Flash in transcription and understanding.
Free options to try: download locally, use the API, or test on Le Chat's voice mode.
Enterprise features include private deployment, domain-specific fine-tuning, advanced context, and dedicated integration support.
Upcoming features: speaker segmentation, audio markups, word-level timestamps, non-speech audio recognition.
Live webinar on Aug 6 to showcase voice-powered agents.
Hiring for research scientists and engineers to advance voice interface technology.

Hasty Briefsbeta