DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
5 hours ago
- #AI Performance Benchmarking
- #Language Models
- #Natural Language Processing
- Introduces two Mixture-of-Experts (MoE) language models: DeepSeek-V4-Pro with 1.6T total parameters and 49B activated, and DeepSeek-V4-Flash with 284B total parameters and 13B activated, both supporting 1M token contexts.
- Highlights key architectural upgrades: a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), reducing FLOPs and KV cache; Manifold-Constrained Hyper-Connections (mHC); and the Muon optimizer.
- Describes training on >32T tokens and a two-stage post-training pipeline involving domain-specific expert cultivation via SFT and RL with GRPO, followed by unified consolidation through on-policy distillation.
- Presents evaluation results showing DeepSeek-V4 models achieving top performance on benchmarks like MMLU-Pro, C-Eval, and LiveCodeBench, with DeepSeek-V4-Pro-Max claimed as the best open-source model.
- Details three reasoning effort modes: Non-think for fast responses, Think High for logical analysis, and Think Max for maximum reasoning capability, each with specific use cases and response formats.
- Provides model download links on HuggingFace and ModelScope, and includes instructions for local deployment, license info (MIT), and citation details.