VibeVoice-ASR: speech-to-text model designed to handle 60-minute long-form audio
5 days ago
- #multilingual
- #ASR
- #speech-to-text
- VibeVoice-ASR is a unified speech-to-text model for 60-minute long-form audio.
- Generates structured transcriptions with Who, When, and What details.
- Supports Customized Hotwords and over 50 languages.
- Features 60-minute single-pass processing without slicing audio.
- Includes speaker tracking, semantic coherence, and multilingual support.
- Jointly performs ASR, diarization, and timestamping.
- Open-source with MIT License, developed by Microsoft Research.