Hasty Briefsbeta

Bilingual

Kyutai 1.6B Streaming TTS

10 months ago
  • #AI
  • #text-to-speech
  • #streaming
  • Kyutai TTS is a streaming text-to-speech model that starts outputting audio as soon as the first few words are input.
  • The model uses a hierarchical Transformer architecture with a 1B parameter backbone and 600M parameter depth transformer.
  • It supports English and French languages and operates at a frame rate of 12.5 Hz with 32 audio tokens per frame.
  • Voice conditioning is possible through pre-computed embeddings, but the model does not support Classifier Free Guidance (CFG) directly.
  • Training involved 750k steps with a batch size of 64 and utilized 32 H100 Nvidia GPUs for pretraining.
  • The model is licensed under CC-BY 4.0 and does not perform watermarking due to its ineffectiveness with open-source models.