Xiaomi MiMo-v2.5-Pro Open-Sourced: 1T Parameter Model
8 hours ago
- #mixture-of-experts
- #long-context
- #large-language-model
- MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters.
- It features a hybrid attention architecture (interleaving Sliding Window Attention and Global Attention) and 3-layers Multi-Token Prediction (MTP), supporting up to 1M tokens context length.
- The model is designed for demanding agentic tasks, complex software engineering, and long-horizon tasks, maintaining coherence over long contexts.
- It was efficiently pre-trained on 27T tokens using FP8 mixed precision and post-trained with SFT, agentic RL, and Multi-Teacher On-Policy Distillation (MOPD).
- Evaluation results show strong performance on benchmarks like MMLU (89.4), GSM8K (99.6), and long-context tasks (e.g., GraphWalks), outperforming previous versions.
- Deployment recommendations include using SGLang or vLLM for optimal performance, with specific configuration examples provided.