Hasty Briefsbeta

Deploying DeepSeek on 96 H100 GPUs

12 days ago
  • #Expert Parallelism
  • #GPU Inference
  • #LLM Optimization
  • DeepSeek is an open-source large language model (LLM) with strong performance, utilizing Multi-head Latent Attention (MLA) and Mixture of Experts (MoE).
  • SGLang optimizes DeepSeek's inference by deploying it on 12 nodes with 8 H100 GPUs each, achieving 52.3k input tokens/sec and 22.3k output tokens/sec per node.
  • Prefill-decode (PD) disaggregation separates prefill (compute-intensive) and decode (memory-intensive) phases, improving efficiency and reducing latency.
  • Large-scale Expert Parallelism (EP) distributes expert weights across GPUs, addressing memory bottlenecks and workload imbalance with tools like DeepEP and EPLB.
  • Two-batch Overlap (TBO) optimizes communication and computation overlap, enhancing throughput by up to 35% in prefill and decode phases.
  • SGLang's implementation achieves near-par performance with DeepSeek's official benchmarks, costing $0.20/1M output tokens (1/5th of DeepSeek Chat API).
  • Key optimizations include DisposableTensor for memory management, DeepGEMM for efficient MoE computation, and workload simulation tools for expert distribution analysis.
  • Limitations include high TTFT (2-5s) and ITL (~100ms), sequence length constraints, and lack of full MTP integration with DP attention.
  • Future work includes latency optimization, Blackwell architecture support, and flexible tensor parallelism for dense FFNs.