Microsoft open weights VibeVoice TTS supports 90 minutes speech, 4 speakers

15 days ago

Copy Link

VibeVoice is a framework for generating expressive, long-form, multi-speaker conversational audio like podcasts.
It uses continuous speech tokenizers at 7.5 Hz for efficiency and fidelity.
The model supports up to 4 speakers and can generate speech up to 90 minutes long.
VibeVoice-7B-Preview model weights have been open-sourced.
The framework includes a next-token diffusion framework with an LLM for context understanding.
Examples include cross-lingual outputs, spontaneous singing, and long conversations.
Installation involves using NVIDIA Deep Learning Container and running specific demo scripts.
The model may generate background music based on voice prompts and input text.
No text normalization is performed; the model handles complex inputs directly.
Training data lacks music, but the model can sing (though potentially off-key).
Chinese data is limited, and special characters may cause pronunciation issues.
Potential misuse includes deepfakes and disinformation; users must comply with laws and disclose AI use.
Supports only English and Chinese; non-speech audio and overlapping speech are not handled.

Hasty Briefsbeta