Microsoft open weights VibeVoice TTS supports 90 minutes speech, 4 speakers
15 days ago
- #AI Audio Generation
- #Text-to-Speech
- #Conversational AI
- VibeVoice is a framework for generating expressive, long-form, multi-speaker conversational audio like podcasts.
- It uses continuous speech tokenizers at 7.5 Hz for efficiency and fidelity.
- The model supports up to 4 speakers and can generate speech up to 90 minutes long.
- VibeVoice-7B-Preview model weights have been open-sourced.
- The framework includes a next-token diffusion framework with an LLM for context understanding.
- Examples include cross-lingual outputs, spontaneous singing, and long conversations.
- Installation involves using NVIDIA Deep Learning Container and running specific demo scripts.
- The model may generate background music based on voice prompts and input text.
- No text normalization is performed; the model handles complex inputs directly.
- Training data lacks music, but the model can sing (though potentially off-key).
- Chinese data is limited, and special characters may cause pronunciation issues.
- Potential misuse includes deepfakes and disinformation; users must comply with laws and disclose AI use.
- Supports only English and Chinese; non-speech audio and overlapping speech are not handled.