Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

5 hours ago

Phi-4-reasoning-vision-15B is a 15B parameter open-weight multimodal reasoning model, excelling in math, science, and computer-use tasks.
The model balances reasoning power, efficiency, and training data needs, offering competitive performance with lower compute costs.
Key design choices include mid-fusion architecture, dynamic resolution vision encoder (SigLIP-2 Naflex variant), and rigorous data curation.
Training used 200B tokens of multimodal data, significantly less than comparable models (e.g., Qwen 2.5 VL used 1T+ tokens).
Data strategy focused on quality: filtered open-source datasets, high-quality internal data, and synthetic data for text-rich visual reasoning.
The model employs a mixed reasoning/non-reasoning approach (20%/80% data split) to optimize latency and accuracy for different tasks.
Evaluation shows strong performance across benchmarks (MathVista: 75.2, ScreenSpot_v2: 88.2) while maintaining efficiency.
Applications include image captioning, GUI interaction, educational support, and scientific analysis with low-latency requirements.
Released under permissive license on Microsoft Foundry, HuggingFace, and GitHub with model weights, fine-tuning code, and benchmark logs.

Hasty Briefsbeta