TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization
3 days ago
- #AI
- #Text-to-Speech
- #Voice Technology
- TADA (Text-Acoustic Dual Alignment) introduces a novel tokenization schema to synchronize text and speech one-to-one, resolving the mismatch in LLM-based TTS systems.
- TADA is the fastest LLM-based TTS system, offering competitive voice quality, virtually zero content hallucinations, and a lightweight footprint for on-device deployment.
- The approach aligns audio representations directly to text tokens, creating a synchronized stream where text and speech move in lockstep, improving speed and reliability.
- TADA generates speech at a real-time factor (RTF) of 0.09, more than 5x faster than similar systems, with zero hallucinations in tests.
- Human evaluation scores TADA high on speaker similarity (4.18/5.0) and naturalness (3.78/5.0), making it suitable for expressive, long-form speech.
- Potential applications include on-device deployment, long-form and conversational speech, and production reliability in regulated environments.
- Limitations include occasional speaker drift in long generations and a modality gap when generating text alongside speech, with ongoing work to address these.
- Hume AI is open-sourcing TADA, releasing 1B and 3B parameter models, and inviting researchers to build on this work for new applications and improvements.