Qwen3-Omni: Native Omni AI Model for Text, Image & Video

9 hours ago

Copy Link

Qwen3-Omni is a multilingual omni-modal foundation model that processes text, images, audio, and video, delivering real-time streaming responses in text and natural speech.
Key features include state-of-the-art performance across modalities, support for 119 text languages, 19 speech input languages, and 10 speech output languages.
The model introduces a novel MoE-based Thinker–Talker design with AuT pretraining and a multi-codebook design for low latency.
Qwen3-Omni supports real-time audio/video interaction with low-latency streaming and flexible control via system prompts.
The model includes a detailed audio captioner, Qwen3-Omni-30B-A3B-Captioner, which is open-source and designed for low-hallucination audio captioning.
Qwen3-Omni can be deployed using Hugging Face Transformers, vLLM, or DashScope API, with recommendations for using vLLM for large-scale or low-latency requirements.
The model supports batch inference, real-time audio output, and customizable voice types (Ethan, Chelsie, Aiden).
Performance benchmarks show Qwen3-Omni achieving SOTA on 22 of 36 audio/video benchmarks and open-source SOTA on 32 of 36.
The model is available in different versions: Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner.
Installation and usage guides are provided for local deployment, web demos, and Docker images.

Hasty Briefsbeta