Hasty Briefsbeta

Bilingual

Exploring JEPA for real-time speech translation

3 days ago
  • #self-supervised-learning
  • #audio-encoder
  • #speech-translation
  • JEPA-v0 is a self-supervised audio encoder designed for real-time speech-to-speech translation, aiming to preserve voice, emotion, and timing.
  • Traditional translation models lose paralinguistic features (e.g., pitch, rhythm) by converting speech to text and back, while JEPA-v0 works directly with audio representations.
  • Self-supervised learning allows JEPA-v0 to train on diverse audio data (speech, music, environmental sounds) without labeled datasets, unlike supervised models like Whisper.
  • JEPA avoids collapse in self-supervised learning through mechanisms like stop-gradient on the target encoder, EMA momentum, and a predictor bottleneck.
  • JEPA-v0 performs well on tasks like spoofing detection (0.927) but struggles with linguistic tasks like speech recognition (0.000) and general captioning (0.478).
  • Visualizations of embeddings show JEPA-v0 captures broad acoustic patterns (e.g., emotion in CREMA-D, music texture in GTZAN) but lacks phonemic resolution.
  • Future improvements include increasing temporal resolution, preserving frequency structure, and integrating with translation decoders to retain speaker characteristics.