Deploying DeepSeek on 96 H100 GPUs

12 days ago

Copy Link

DeepSeek is an open-source large language model (LLM) with strong performance, utilizing Multi-head Latent Attention (MLA) and Mixture of Experts (MoE).
SGLang optimizes DeepSeek's inference by deploying it on 12 nodes with 8 H100 GPUs each, achieving 52.3k input tokens/sec and 22.3k output tokens/sec per node.
Prefill-decode (PD) disaggregation separates prefill (compute-intensive) and decode (memory-intensive) phases, improving efficiency and reducing latency.
Large-scale Expert Parallelism (EP) distributes expert weights across GPUs, addressing memory bottlenecks and workload imbalance with tools like DeepEP and EPLB.
Two-batch Overlap (TBO) optimizes communication and computation overlap, enhancing throughput by up to 35% in prefill and decode phases.
SGLang's implementation achieves near-par performance with DeepSeek's official benchmarks, costing $0.20/1M output tokens (1/5th of DeepSeek Chat API).
Key optimizations include DisposableTensor for memory management, DeepGEMM for efficient MoE computation, and workload simulation tools for expert distribution analysis.
Limitations include high TTFT (2-5s) and ITL (~100ms), sequence length constraints, and lack of full MTP integration with DP attention.
Future work includes latency optimization, Blackwell architecture support, and flexible tensor parallelism for dense FFNs.

Hasty Briefsbeta