Deploying DeepSeek on 96 H100 GPUs
12 days ago
- #Expert Parallelism
- #GPU Inference
- #LLM Optimization
- DeepSeek is an open-source large language model (LLM) with strong performance, utilizing Multi-head Latent Attention (MLA) and Mixture of Experts (MoE).
- SGLang optimizes DeepSeek's inference by deploying it on 12 nodes with 8 H100 GPUs each, achieving 52.3k input tokens/sec and 22.3k output tokens/sec per node.
- Prefill-decode (PD) disaggregation separates prefill (compute-intensive) and decode (memory-intensive) phases, improving efficiency and reducing latency.
- Large-scale Expert Parallelism (EP) distributes expert weights across GPUs, addressing memory bottlenecks and workload imbalance with tools like DeepEP and EPLB.
- Two-batch Overlap (TBO) optimizes communication and computation overlap, enhancing throughput by up to 35% in prefill and decode phases.
- SGLang's implementation achieves near-par performance with DeepSeek's official benchmarks, costing $0.20/1M output tokens (1/5th of DeepSeek Chat API).
- Key optimizations include DisposableTensor for memory management, DeepGEMM for efficient MoE computation, and workload simulation tools for expert distribution analysis.
- Limitations include high TTFT (2-5s) and ITL (~100ms), sequence length constraints, and lack of full MTP integration with DP attention.
- Future work includes latency optimization, Blackwell architecture support, and flexible tensor parallelism for dense FFNs.