Qwen3.5-Omni Technical Report
19 hours ago
- #Large Language Models
- #Multimodal AI
- #Speech Synthesis
- Qwen3.5-Omni is an advanced multimodal model scaling to hundreds of billions of parameters with a 256k context length.
- It achieves SOTA on 215 audio and audio-visual tasks, surpassing or matching competitors like Gemini-3.1 Pro.
- The model uses a Hybrid Attention MoE framework for efficient long-sequence inference and supports extensive audio and video processing.
- ARIA is introduced to enhance streaming speech synthesis stability by dynamically aligning text and speech units.
- It supports multilingual understanding and speech generation across 10 languages with emotional nuance and advanced audio-visual grounding capabilities.
- The model exhibits a novel Audio-Visual Vibe Coding capability, enabling coding directly from audio-visual instructions.