Llasa: Llama-Based Speech Synthesis

a year ago

Explores scaling of train-time and inference-time compute for speech synthesis.
Proposes LLaSA, a simple framework using a single-layer VQ codec and Transformer architecture aligned with LLMs like LLaMA.
Shows that scaling train-time compute improves speech naturalness and prosody patterns.
Demonstrates that scaling inference-time compute enhances emotional expressiveness, timbre consistency, and content accuracy via verifiers.
Releases checkpoints and training code for TTS models (1B, 3B, 8B) and codec model publicly.
Compares inference-time scaling results using different evaluation metrics and benchmarks like Ravdess.
Evaluates models across various sizes and training data amounts, highlighting text comprehension and synthesis quality.

Hasty Briefsbeta