Voxtral – Frontier open source speech understanding models
10 months ago
- #AI-models
- #speech-recognition
- #open-source
- Introduction of Voxtral, frontier open-source speech understanding models.
- Voice as the original and most natural human-computer interface.
- Current limitations of voice systems: unreliable, proprietary, and brittle.
- Voxtral models aim to bridge the gap with exceptional transcription, deep understanding, multilingual fluency, and open deployment.
- Available in two sizes: 24B for production-scale and 3B for local/edge deployments, both under Apache 2.0 license.
- Voxtral offers state-of-the-art accuracy and semantic understanding at less than half the price of comparable APIs.
- Capabilities include long-form context (up to 30-40 minutes), built-in Q&A and summarization, multilingual support, and function-calling from voice.
- Benchmarks show Voxtral outperforms leading models like Whisper, GPT-4o mini, and Gemini 2.5 Flash in transcription and understanding.
- Free options to try: download locally, use the API, or test on Le Chat's voice mode.
- Enterprise features include private deployment, domain-specific fine-tuning, advanced context, and dedicated integration support.
- Upcoming features: speaker segmentation, audio markups, word-level timestamps, non-speech audio recognition.
- Live webinar on Aug 6 to showcase voice-powered agents.
- Hiring for research scientists and engineers to advance voice interface technology.