MisoTTS Emotive Speech Model

8 hours ago

Introduces MisoTTS, an 8-billion-parameter transformer for emotive speech generation, addressing limitations of existing text-to-speech models.
Uses Residual Vector Quantization (RVQ) to exponentially increase addressable audio tokens to approximately 3.4 x 10^81, avoiding vocabulary size constraints.
Conditions on both text and audio context for more natural, expressive, and context-aware speech, responding to user tone.
Features a split architecture: a 7.7B-parameter backbone predicts the first codebook index, while a 300M-parameter decoder predicts remaining indices across RVQ depth.
Currently models individual turns and half-duplex audio, with turn-taking and full-duplex conversation noted as future work.
Open-source weights are available on Hugging Face, with API access planned; licensed under a modified MIT license.

Hasty Briefsbeta