Hasty Briefsbeta

Microsoft open weights VibeVoice TTS supports 90 minutes speech, 4 speakers

15 days ago
  • #AI Audio Generation
  • #Text-to-Speech
  • #Conversational AI
  • VibeVoice is a framework for generating expressive, long-form, multi-speaker conversational audio like podcasts.
  • It uses continuous speech tokenizers at 7.5 Hz for efficiency and fidelity.
  • The model supports up to 4 speakers and can generate speech up to 90 minutes long.
  • VibeVoice-7B-Preview model weights have been open-sourced.
  • The framework includes a next-token diffusion framework with an LLM for context understanding.
  • Examples include cross-lingual outputs, spontaneous singing, and long conversations.
  • Installation involves using NVIDIA Deep Learning Container and running specific demo scripts.
  • The model may generate background music based on voice prompts and input text.
  • No text normalization is performed; the model handles complex inputs directly.
  • Training data lacks music, but the model can sing (though potentially off-key).
  • Chinese data is limited, and special characters may cause pronunciation issues.
  • Potential misuse includes deepfakes and disinformation; users must comply with laws and disclose AI use.
  • Supports only English and Chinese; non-speech audio and overlapping speech are not handled.