Hasty Briefsbeta

Microsoft VibeVoice: A Frontier Open-Source Text-to-Speech Model

7 days ago
  • #AI Audio Generation
  • #Text-to-Speech
  • #Conversational AI
  • VibeVoice is a framework for generating expressive, long-form, multi-speaker conversational audio like podcasts.
  • It solves challenges in TTS systems, including scalability, speaker consistency, and natural turn-taking.
  • Uses continuous speech tokenizers (Acoustic and Semantic) at 7.5 Hz for audio fidelity and computational efficiency.
  • Employs a next-token diffusion framework with an LLM for text understanding and a diffusion head for acoustic details.
  • Can synthesize speech up to 90 minutes long with up to 4 distinct speakers.