DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

5 hours ago

Introduces two Mixture-of-Experts (MoE) language models: DeepSeek-V4-Pro with 1.6T total parameters and 49B activated, and DeepSeek-V4-Flash with 284B total parameters and 13B activated, both supporting 1M token contexts.
Highlights key architectural upgrades: a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), reducing FLOPs and KV cache; Manifold-Constrained Hyper-Connections (mHC); and the Muon optimizer.
Describes training on >32T tokens and a two-stage post-training pipeline involving domain-specific expert cultivation via SFT and RL with GRPO, followed by unified consolidation through on-policy distillation.
Presents evaluation results showing DeepSeek-V4 models achieving top performance on benchmarks like MMLU-Pro, C-Eval, and LiveCodeBench, with DeepSeek-V4-Pro-Max claimed as the best open-source model.
Details three reasoning effort modes: Non-think for fast responses, Think High for logical analysis, and Think Max for maximum reasoning capability, each with specific use cases and response formats.
Provides model download links on HuggingFace and ModelScope, and includes instructions for local deployment, license info (MIT), and citation details.

Hasty Briefsbeta