MisoTTS Emotive Speech Model
8 hours ago
- #transformer-model
- #text-to-speech
- #emotive-AI
- Introduces MisoTTS, an 8-billion-parameter transformer for emotive speech generation, addressing limitations of existing text-to-speech models.
- Uses Residual Vector Quantization (RVQ) to exponentially increase addressable audio tokens to approximately 3.4 x 10^81, avoiding vocabulary size constraints.
- Conditions on both text and audio context for more natural, expressive, and context-aware speech, responding to user tone.
- Features a split architecture: a 7.7B-parameter backbone predicts the first codebook index, while a 300M-parameter decoder predicts remaining indices across RVQ depth.
- Currently models individual turns and half-duplex audio, with turn-taking and full-duplex conversation noted as future work.
- Open-source weights are available on Hugging Face, with API access planned; licensed under a modified MIT license.