Neural audio codecs: how to get audio into LLMs

3 days ago

Copy Link

Speech LLMs as of October 2025 are not as advanced as text LLMs, often working by transcribing speech to text and back, lacking native speech understanding.
Neural audio codecs are used to compress audio into discrete tokens for LLMs, improving coherence and efficiency in audio modeling.
Training audio LLMs involves tokenizing audio with codecs like Mimi, which uses residual vector quantization (RVQ) and adversarial loss for better quality.
Mimi, a neural audio codec developed by Kyutai, improves audio quality by using semantic tokens and RVQ dropout, aiding in better speech generation.
Audio LLMs still lag behind text LLMs in reasoning and understanding, with models like Moshi showing promise but relying heavily on text streams for reasoning.
Modern audio LLMs like CSM, Qwen3-Omni, MiMo-Audio, and LFM2-Audio use advanced codecs but still face challenges in fully native speech understanding and generation.

Hasty Briefsbeta