Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model
5 hours ago
- #multimodal-ai
- #efficient-models
- #reasoning-systems
- Phi-4-reasoning-vision-15B is a 15B parameter open-weight multimodal reasoning model, excelling in math, science, and computer-use tasks.
- The model balances reasoning power, efficiency, and training data needs, offering competitive performance with lower compute costs.
- Key design choices include mid-fusion architecture, dynamic resolution vision encoder (SigLIP-2 Naflex variant), and rigorous data curation.
- Training used 200B tokens of multimodal data, significantly less than comparable models (e.g., Qwen 2.5 VL used 1T+ tokens).
- Data strategy focused on quality: filtered open-source datasets, high-quality internal data, and synthetic data for text-rich visual reasoning.
- The model employs a mixed reasoning/non-reasoning approach (20%/80% data split) to optimize latency and accuracy for different tasks.
- Evaluation shows strong performance across benchmarks (MathVista: 75.2, ScreenSpot_v2: 88.2) while maintaining efficiency.
- Applications include image captioning, GUI interaction, educational support, and scientific analysis with low-latency requirements.
- Released under permissive license on Microsoft Foundry, HuggingFace, and GitHub with model weights, fine-tuning code, and benchmark logs.