What Went into Training DeepSeek-R1? – Epoch AI
a year ago
- #AI
- #Machine Learning
- #DeepSeek-R1
- DeepSeek-R1 is an open-weights reasoning model released on January 20th, 2025, comparable to OpenAI’s o1 in benchmark performance.
- The architecture of DeepSeek-R1 is identical to DeepSeek v3, featuring a sparse mixture-of-experts (MoE) with 671B total parameters and 37B active per token.
- DeepSeek-R1 uses multi-head latent attention (MLA) to optimize KV cache size, making it arithmetic-bound rather than memory-bound during long-context inference.
- The model was pre-trained on a cluster of 2048 H800 GPUs, costing approximately $5.3M for 14.8 trillion tokens.
- Reinforcement learning (RL) was used to enhance reasoning performance, with an estimated cost of $1M, bringing the total training cost to around $6M.
- DeepSeek-R1's performance is comparable to OpenAI’s o1 but is priced significantly lower at $2.2 per million output tokens versus o1’s $60.
- The model’s efficiency and competitive pricing may pressure US labs to reduce their profit margins.