Audio is the one area small labs are winning
3 months ago
- #Audio
- #AI
- #Startups
- Audio AI models, particularly for voice (TTS, STS, STT), are being developed more effectively by underfunded startups than by major labs.
- Kyutai, an open audio lab, developed Moshi, the first real-time full-duplex conversational AI model, capable of interrupting and backchanneling with 160ms latency.
- Moshi was built by a team of 4 researchers in 6 months, is open-source, and can run on mobile devices.
- Audio AI has been historically overlooked due to data scarcity, cultural biases, and the complexity of generating high-quality audio.
- Small teams outperform big labs in audio AI due to their ability to move quickly, deep domain expertise, and lack of bureaucratic overhead.
- Kyutai's innovations include multi-stream modeling for full-duplex conversations and the Mimi neural audio codec, which compresses speech, music, and general audio effectively.
- Audio models like Moshi are smaller and cheaper to train than text models (7B parameters vs. 405B in Llama 3.1), making them accessible to smaller teams.
- Gradium, a spin-off from Kyutai, focuses on bringing research-grade audio models to production, raising $70M to bridge the gap between research and product.