Xiaomi MiMo-v2.5-Pro Open-Sourced: 1T Parameter Model

8 hours ago

MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters.
It features a hybrid attention architecture (interleaving Sliding Window Attention and Global Attention) and 3-layers Multi-Token Prediction (MTP), supporting up to 1M tokens context length.
The model is designed for demanding agentic tasks, complex software engineering, and long-horizon tasks, maintaining coherence over long contexts.
It was efficiently pre-trained on 27T tokens using FP8 mixed precision and post-trained with SFT, agentic RL, and Multi-Teacher On-Policy Distillation (MOPD).
Evaluation results show strong performance on benchmarks like MMLU (89.4), GSM8K (99.6), and long-context tasks (e.g., GraphWalks), outperforming previous versions.
Deployment recommendations include using SGLang or vLLM for optimal performance, with specific configuration examples provided.

Hasty Briefsbeta