DeepSeek V4 Flash
6 hours ago
- #mixture-of-experts
- #large-language-models
- #long-context
- DeepSeek-V4-Pro (1.6T parameters, 49B activated) and DeepSeek-V4-Flash (284B parameters, 13B activated) are new MoE models with 1 million token context.
- Key architectural improvements include Hybrid Attention (CSA+HCA) for efficiency, mHC for stability, and the Muon optimizer.
- Both models were pretrained on >32T tokens and post-trained via two-stage SFT, RL, and distillation.
- DeepSeek-V4-Pro-Max is positioned as the best open-source model, excelling in coding, reasoning, and agentic tasks.
- Models feature three reasoning modes: Non-think (fast), Think High (slower, analytical), and Think Max (full reasoning).
- Benchmarks show strong performance in knowledge, reasoning, coding, math, long-context, and agentic evaluations.
- The release includes base and instruct models, downloadable with mixed FP4/FP8 precision, under MIT license.
- Local deployment guidance is provided, and chat encoding uses custom Python scripts instead of a Jinja template.