Hasty Briefsbeta

Bilingual

Xiaomi MiMo-v2.5-Pro Open-Sourced: 1T Parameter Model

8 hours ago
  • #mixture-of-experts
  • #long-context
  • #large-language-model
  • MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters.
  • It features a hybrid attention architecture (interleaving Sliding Window Attention and Global Attention) and 3-layers Multi-Token Prediction (MTP), supporting up to 1M tokens context length.
  • The model is designed for demanding agentic tasks, complex software engineering, and long-horizon tasks, maintaining coherence over long contexts.
  • It was efficiently pre-trained on 27T tokens using FP8 mixed precision and post-trained with SFT, agentic RL, and Multi-Teacher On-Policy Distillation (MOPD).
  • Evaluation results show strong performance on benchmarks like MMLU (89.4), GSM8K (99.6), and long-context tasks (e.g., GraphWalks), outperforming previous versions.
  • Deployment recommendations include using SGLang or vLLM for optimal performance, with specific configuration examples provided.