Exploring JEPA for real-time speech translation

3 days ago

JEPA-v0 is a self-supervised audio encoder designed for real-time speech-to-speech translation, aiming to preserve voice, emotion, and timing.
Traditional translation models lose paralinguistic features (e.g., pitch, rhythm) by converting speech to text and back, while JEPA-v0 works directly with audio representations.
Self-supervised learning allows JEPA-v0 to train on diverse audio data (speech, music, environmental sounds) without labeled datasets, unlike supervised models like Whisper.
JEPA avoids collapse in self-supervised learning through mechanisms like stop-gradient on the target encoder, EMA momentum, and a predictor bottleneck.
JEPA-v0 performs well on tasks like spoofing detection (0.927) but struggles with linguistic tasks like speech recognition (0.000) and general captioning (0.478).
Visualizations of embeddings show JEPA-v0 captures broad acoustic patterns (e.g., emotion in CREMA-D, music texture in GTZAN) but lacks phonemic resolution.
Future improvements include increasing temporal resolution, preserving frequency structure, and integrating with translation decoders to retain speaker characteristics.

Hasty Briefsbeta