Kyutai 1.6B Streaming TTS

10 months ago

Kyutai TTS is a streaming text-to-speech model that starts outputting audio as soon as the first few words are input.
The model uses a hierarchical Transformer architecture with a 1B parameter backbone and 600M parameter depth transformer.
It supports English and French languages and operates at a frame rate of 12.5 Hz with 32 audio tokens per frame.
Voice conditioning is possible through pre-computed embeddings, but the model does not support Classifier Free Guidance (CFG) directly.
Training involved 750k steps with a batch size of 64 and utilized 32 H100 Nvidia GPUs for pretraining.
The model is licensed under CC-BY 4.0 and does not perform watermarking due to its ineffectiveness with open-source models.

Hasty Briefsbeta