GitHub - deepseek-ai/DeepSeek-V3
3 hours ago
- #Large Language Model
- #DeepSeek-V3
- #Mixture-of-Experts
- DeepSeek-V3 is a 671B total parameter Mixture-of-Experts language model with 37B activated parameters per token, featuring efficient architectures like Multi-head Latent Attention and DeepSeekMoE.
- The model pioneers an auxiliary-loss-free load balancing strategy and uses a multi-token prediction training objective for improved performance and inference acceleration.
- It was pre-trained on 14.8 trillion tokens using FP8 mixed precision training, achieving high training efficiency with only 2.788M H800 GPU hours and stable training without loss spikes.
- Post-training includes knowledge distillation from DeepSeek-R1 for enhanced reasoning capabilities and control over output style and length.
- Evaluation shows DeepSeek-V3 outperforms open-source models and competes with leading closed-source models, excelling in math and code tasks, and supports up to 128K context length.
- The model is available for local deployment via various frameworks like SGLang, LMDeploy, TensorRT-LLM, vLLM, and LightLLM, supporting FP8 and BF16 precision on NVIDIA, AMD, and Huawei Ascend hardware.
- It is licensed under MIT for the code and a model license for commercial use, with resources on Hugging Face and a technical report on arXiv.