High-Fidelity Simultaneous Speech-to-Speech Translation

a year ago

Hibiki is a decoder-only model for simultaneous speech translation.
It uses a multistream language model to process source and target speech synchronously.
The model jointly produces text and audio tokens for speech-to-text and speech-to-speech translation.
A weakly-supervised method leverages perplexity of an off-the-shelf text translation system to identify optimal delays.
Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling.
It achieves state-of-the-art performance in translation quality, speaker fidelity, and naturalness.
The model is compatible with batched translation and real-time on-device deployment.

Hasty Briefsbeta