10 months ago
- Kyutai TTS is a streaming text-to-speech model that starts outputting audio as soon as the first few words are input.
- The model uses a hierarchical Transformer architecture with a 1B parameter backbone and 600M parameter depth transformer.
- It supports English and French languages and operates at a frame rate of 12.5 Hz with 32 audio tokens per frame.
- Voice conditioning is possible through pre-computed embeddings, but the model does not support Classifier Free Guidance (CFG) directly.
- Training involved 750k steps with a batch size of 64 and utilized 32 H100 Nvidia GPUs for pretraining.
- The model is licensed under CC-BY 4.0 and does not perform watermarking due to its ineffectiveness with open-source models.