High-Fidelity Simultaneous Speech-to-Speech Translation
10 months ago
- #natural language processing
- #machine learning
- #speech translation
- Hibiki is a decoder-only model for simultaneous speech translation.
- It uses a multistream language model to process source and target speech synchronously.
- The model jointly produces text and audio tokens for speech-to-text and speech-to-speech translation.
- A weakly-supervised method leverages perplexity of an off-the-shelf text translation system to identify optimal delays.
- Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling.
- It achieves state-of-the-art performance in translation quality, speaker fidelity, and naturalness.
- The model is compatible with batched translation and real-time on-device deployment.