Audio is the one area small labs are winning

3 months ago

Audio AI models, particularly for voice (TTS, STS, STT), are being developed more effectively by underfunded startups than by major labs.
Kyutai, an open audio lab, developed Moshi, the first real-time full-duplex conversational AI model, capable of interrupting and backchanneling with 160ms latency.
Moshi was built by a team of 4 researchers in 6 months, is open-source, and can run on mobile devices.
Audio AI has been historically overlooked due to data scarcity, cultural biases, and the complexity of generating high-quality audio.
Small teams outperform big labs in audio AI due to their ability to move quickly, deep domain expertise, and lack of bureaucratic overhead.
Kyutai's innovations include multi-stream modeling for full-duplex conversations and the Mimi neural audio codec, which compresses speech, music, and general audio effectively.
Audio models like Moshi are smaller and cheaper to train than text models (7B parameters vs. 405B in Llama 3.1), making them accessible to smaller teams.
Gradium, a spin-off from Kyutai, focuses on bringing research-grade audio models to production, raising $70M to bridge the gap between research and product.

Hasty Briefsbeta