Hasty Briefsbeta

Bilingual

MisoTTS Emotive Speech Model

6 hours ago
  • #transformer-model
  • #text-to-speech
  • #emotive-AI
  • Introduces MisoTTS, an 8-billion-parameter transformer for emotive speech generation, addressing limitations of existing text-to-speech models.
  • Uses Residual Vector Quantization (RVQ) to exponentially increase addressable audio tokens to approximately 3.4 x 10^81, avoiding vocabulary size constraints.
  • Conditions on both text and audio context for more natural, expressive, and context-aware speech, responding to user tone.
  • Features a split architecture: a 7.7B-parameter backbone predicts the first codebook index, while a 300M-parameter decoder predicts remaining indices across RVQ depth.
  • Currently models individual turns and half-duplex audio, with turn-taking and full-duplex conversation noted as future work.
  • Open-source weights are available on Hugging Face, with API access planned; licensed under a modified MIT license.