Llasa: Llama-Based Speech Synthesis
a year ago
- #scaling
- #LLMs
- #speech-synthesis
- Explores scaling of train-time and inference-time compute for speech synthesis.
- Proposes LLaSA, a simple framework using a single-layer VQ codec and Transformer architecture aligned with LLMs like LLaMA.
- Shows that scaling train-time compute improves speech naturalness and prosody patterns.
- Demonstrates that scaling inference-time compute enhances emotional expressiveness, timbre consistency, and content accuracy via verifiers.
- Releases checkpoints and training code for TTS models (1B, 3B, 8B) and codec model publicly.
- Compares inference-time scaling results using different evaluation metrics and benchmarks like Ravdess.
- Evaluates models across various sizes and training data amounts, highlighting text comprehension and synthesis quality.