Qwen3.5-Omni Technical Report

17 hours ago

Qwen3.5-Omni is an advanced multimodal model scaling to hundreds of billions of parameters with a 256k context length.
It achieves SOTA on 215 audio and audio-visual tasks, surpassing or matching competitors like Gemini-3.1 Pro.
The model uses a Hybrid Attention MoE framework for efficient long-sequence inference and supports extensive audio and video processing.
ARIA is introduced to enhance streaming speech synthesis stability by dynamically aligning text and speech units.
It supports multilingual understanding and speech generation across 10 languages with emotional nuance and advanced audio-visual grounding capabilities.
The model exhibits a novel Audio-Visual Vibe Coding capability, enabling coding directly from audio-visual instructions.

Hasty Briefsbeta