Neural audio codecs: how to get audio into LLMs
3 days ago
- #speech-llms
- #neural-audio-codecs
- #audio-generation
- Speech LLMs as of October 2025 are not as advanced as text LLMs, often working by transcribing speech to text and back, lacking native speech understanding.
- Neural audio codecs are used to compress audio into discrete tokens for LLMs, improving coherence and efficiency in audio modeling.
- Training audio LLMs involves tokenizing audio with codecs like Mimi, which uses residual vector quantization (RVQ) and adversarial loss for better quality.
- Mimi, a neural audio codec developed by Kyutai, improves audio quality by using semantic tokens and RVQ dropout, aiding in better speech generation.
- Audio LLMs still lag behind text LLMs in reasoning and understanding, with models like Moshi showing promise but relying heavily on text streams for reasoning.
- Modern audio LLMs like CSM, Qwen3-Omni, MiMo-Audio, and LFM2-Audio use advanced codecs but still face challenges in fully native speech understanding and generation.