Hasty Briefsbeta

Bilingual

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

5 hours ago
  • #multimodal-ai
  • #efficient-models
  • #reasoning-systems
  • Phi-4-reasoning-vision-15B is a 15B parameter open-weight multimodal reasoning model, excelling in math, science, and computer-use tasks.
  • The model balances reasoning power, efficiency, and training data needs, offering competitive performance with lower compute costs.
  • Key design choices include mid-fusion architecture, dynamic resolution vision encoder (SigLIP-2 Naflex variant), and rigorous data curation.
  • Training used 200B tokens of multimodal data, significantly less than comparable models (e.g., Qwen 2.5 VL used 1T+ tokens).
  • Data strategy focused on quality: filtered open-source datasets, high-quality internal data, and synthetic data for text-rich visual reasoning.
  • The model employs a mixed reasoning/non-reasoning approach (20%/80% data split) to optimize latency and accuracy for different tasks.
  • Evaluation shows strong performance across benchmarks (MathVista: 75.2, ScreenSpot_v2: 88.2) while maintaining efficiency.
  • Applications include image captioning, GUI interaction, educational support, and scientific analysis with low-latency requirements.
  • Released under permissive license on Microsoft Foundry, HuggingFace, and GitHub with model weights, fine-tuning code, and benchmark logs.