Hasty Briefsbeta

Bilingual

GitHub - deepseek-ai/DeepSeek-V3

3 hours ago
  • #Large Language Model
  • #DeepSeek-V3
  • #Mixture-of-Experts
  • DeepSeek-V3 is a 671B total parameter Mixture-of-Experts language model with 37B activated parameters per token, featuring efficient architectures like Multi-head Latent Attention and DeepSeekMoE.
  • The model pioneers an auxiliary-loss-free load balancing strategy and uses a multi-token prediction training objective for improved performance and inference acceleration.
  • It was pre-trained on 14.8 trillion tokens using FP8 mixed precision training, achieving high training efficiency with only 2.788M H800 GPU hours and stable training without loss spikes.
  • Post-training includes knowledge distillation from DeepSeek-R1 for enhanced reasoning capabilities and control over output style and length.
  • Evaluation shows DeepSeek-V3 outperforms open-source models and competes with leading closed-source models, excelling in math and code tasks, and supports up to 128K context length.
  • The model is available for local deployment via various frameworks like SGLang, LMDeploy, TensorRT-LLM, vLLM, and LightLLM, supporting FP8 and BF16 precision on NVIDIA, AMD, and Huawei Ascend hardware.
  • It is licensed under MIT for the code and a model license for commercial use, with resources on Hugging Face and a technical report on arXiv.