Qwen3-VL can scan two-hour videos and pinpoint nearly every detail
11 days ago
- #Open Source
- #Multimodal AI
- #Alibaba
- Alibaba releases technical report on Qwen3-VL, an open multimodal model excelling in image-based math tasks and video analysis.
- The model processes large data loads, including two-hour videos or hundreds of document pages within a 256,000-token context window.
- Achieves 100% accuracy in locating individual frames in 30-minute videos and 99.5% in two-hour videos in 'needle-in-a-haystack' tests.
- Outperforms competitors like Gemini 2.5 Pro, GPT-5, and Claude Opus 4.1 in benchmarks, especially in visual math tasks.
- Scores 85.8% on MathVista and 74.6% on MathVision, leading over competitors.
- Excels in specialized benchmarks: 96.5% on DocVQA, 875 points on OCRBench (39 languages), and strong performance in GUI agent tasks.
- Handles complex PDF documents and scientific charts well, with scores of 56.2% on MMLongBench-Doc and 90.5% on CharXiv.
- Lags in general reasoning tasks, scoring 69.3% on MMMU-Pro compared to GPT-5's 78.4%.
- Key architectural upgrades include interleaved MRoPE, DeepStack technology, and a text-based timestamp system.
- Trained on one trillion tokens across four phases, expanding context windows from 8,000 to 262,000 tokens.
- Open weights available under Apache 2.0 license, with models ranging from 2B to 235B parameters.
- Qwen3-VL is positioned to drive further open-source development in multimodal AI.