Hasty Briefsbeta

Neural audio codecs: how to get audio into LLMs

3 days ago
  • #speech-llms
  • #neural-audio-codecs
  • #audio-generation
  • Speech LLMs as of October 2025 are not as advanced as text LLMs, often working by transcribing speech to text and back, lacking native speech understanding.
  • Neural audio codecs are used to compress audio into discrete tokens for LLMs, improving coherence and efficiency in audio modeling.
  • Training audio LLMs involves tokenizing audio with codecs like Mimi, which uses residual vector quantization (RVQ) and adversarial loss for better quality.
  • Mimi, a neural audio codec developed by Kyutai, improves audio quality by using semantic tokens and RVQ dropout, aiding in better speech generation.
  • Audio LLMs still lag behind text LLMs in reasoning and understanding, with models like Moshi showing promise but relying heavily on text streams for reasoning.
  • Modern audio LLMs like CSM, Qwen3-Omni, MiMo-Audio, and LFM2-Audio use advanced codecs but still face challenges in fully native speech understanding and generation.