Hasty Briefsbeta

Bilingual

TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization

3 days ago
  • #AI
  • #Text-to-Speech
  • #Voice Technology
  • TADA (Text-Acoustic Dual Alignment) introduces a novel tokenization schema to synchronize text and speech one-to-one, resolving the mismatch in LLM-based TTS systems.
  • TADA is the fastest LLM-based TTS system, offering competitive voice quality, virtually zero content hallucinations, and a lightweight footprint for on-device deployment.
  • The approach aligns audio representations directly to text tokens, creating a synchronized stream where text and speech move in lockstep, improving speed and reliability.
  • TADA generates speech at a real-time factor (RTF) of 0.09, more than 5x faster than similar systems, with zero hallucinations in tests.
  • Human evaluation scores TADA high on speaker similarity (4.18/5.0) and naturalness (3.78/5.0), making it suitable for expressive, long-form speech.
  • Potential applications include on-device deployment, long-form and conversational speech, and production reliability in regulated environments.
  • Limitations include occasional speaker drift in long generations and a modality gap when generating text alongside speech, with ongoing work to address these.
  • Hume AI is open-sourcing TADA, releasing 1B and 3B parameter models, and inviting researchers to build on this work for new applications and improvements.