Hasty Briefsbeta

Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

11 days ago
  • #Open Source
  • #Multimodal AI
  • #Alibaba
  • Alibaba releases technical report on Qwen3-VL, an open multimodal model excelling in image-based math tasks and video analysis.
  • The model processes large data loads, including two-hour videos or hundreds of document pages within a 256,000-token context window.
  • Achieves 100% accuracy in locating individual frames in 30-minute videos and 99.5% in two-hour videos in 'needle-in-a-haystack' tests.
  • Outperforms competitors like Gemini 2.5 Pro, GPT-5, and Claude Opus 4.1 in benchmarks, especially in visual math tasks.
  • Scores 85.8% on MathVista and 74.6% on MathVision, leading over competitors.
  • Excels in specialized benchmarks: 96.5% on DocVQA, 875 points on OCRBench (39 languages), and strong performance in GUI agent tasks.
  • Handles complex PDF documents and scientific charts well, with scores of 56.2% on MMLongBench-Doc and 90.5% on CharXiv.
  • Lags in general reasoning tasks, scoring 69.3% on MMMU-Pro compared to GPT-5's 78.4%.
  • Key architectural upgrades include interleaved MRoPE, DeepStack technology, and a text-based timestamp system.
  • Trained on one trillion tokens across four phases, expanding context windows from 8,000 to 262,000 tokens.
  • Open weights available under Apache 2.0 license, with models ranging from 2B to 235B parameters.
  • Qwen3-VL is positioned to drive further open-source development in multimodal AI.