Hasty Briefsbeta

Bilingual

Llasa: Llama-Based Speech Synthesis

a year ago
  • #scaling
  • #LLMs
  • #speech-synthesis
  • Explores scaling of train-time and inference-time compute for speech synthesis.
  • Proposes LLaSA, a simple framework using a single-layer VQ codec and Transformer architecture aligned with LLMs like LLaMA.
  • Shows that scaling train-time compute improves speech naturalness and prosody patterns.
  • Demonstrates that scaling inference-time compute enhances emotional expressiveness, timbre consistency, and content accuracy via verifiers.
  • Releases checkpoints and training code for TTS models (1B, 3B, 8B) and codec model publicly.
  • Compares inference-time scaling results using different evaluation metrics and benchmarks like Ravdess.
  • Evaluates models across various sizes and training data amounts, highlighting text comprehension and synthesis quality.