Microsoft VibeVoice: A Frontier Open-Source Text-to-Speech Model

7 days ago

Copy Link

VibeVoice is a framework for generating expressive, long-form, multi-speaker conversational audio like podcasts.
It solves challenges in TTS systems, including scalability, speaker consistency, and natural turn-taking.
Uses continuous speech tokenizers (Acoustic and Semantic) at 7.5 Hz for audio fidelity and computational efficiency.
Employs a next-token diffusion framework with an LLM for text understanding and a diffusion head for acoustic details.
Can synthesize speech up to 90 minutes long with up to 4 distinct speakers.

Hasty Briefsbeta