Microsoft VibeVoice: A Frontier Open-Source Text-to-Speech Model
7 days ago
- #AI Audio Generation
- #Text-to-Speech
- #Conversational AI
- VibeVoice is a framework for generating expressive, long-form, multi-speaker conversational audio like podcasts.
- It solves challenges in TTS systems, including scalability, speaker consistency, and natural turn-taking.
- Uses continuous speech tokenizers (Acoustic and Semantic) at 7.5 Hz for audio fidelity and computational efficiency.
- Employs a next-token diffusion framework with an LLM for text understanding and a diffusion head for acoustic details.
- Can synthesize speech up to 90 minutes long with up to 4 distinct speakers.