TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization

2 months ago

TADA (Text-Acoustic Dual Alignment) introduces a novel tokenization schema to synchronize text and speech one-to-one, resolving the mismatch in LLM-based TTS systems.
TADA is the fastest LLM-based TTS system, offering competitive voice quality, virtually zero content hallucinations, and a lightweight footprint for on-device deployment.
The approach aligns audio representations directly to text tokens, creating a synchronized stream where text and speech move in lockstep, improving speed and reliability.
TADA generates speech at a real-time factor (RTF) of 0.09, more than 5x faster than similar systems, with zero hallucinations in tests.
Human evaluation scores TADA high on speaker similarity (4.18/5.0) and naturalness (3.78/5.0), making it suitable for expressive, long-form speech.
Potential applications include on-device deployment, long-form and conversational speech, and production reliability in regulated environments.
Limitations include occasional speaker drift in long generations and a modality gap when generating text alongside speech, with ongoing work to address these.
Hume AI is open-sourcing TADA, releasing 1B and 3B parameter models, and inviting researchers to build on this work for new applications and improvements.

Hasty Briefsbeta