GitHub - deepseek-ai/DeepEP: DeepEP: an efficient expert-parallel communication library

a month ago

#High-Performance Computing
#Mixture-of-Experts
#GPU Communication

DeepEP is a communication library for Mixture-of-Experts (MoE) and expert parallelism, offering high-throughput, low-latency all-to-all GPU kernels (dispatch and combine) with low-precision support like FP8.
The library includes kernels optimized for asymmetric-domain bandwidth forwarding for group-limited gating (e.g., NVLink to RDMA) and low-latency kernels with pure RDMA for inference decoding, along with an SM-free communication-computation overlapping method.
Performance tests on H800 GPUs show intranode and internode bandwidths up to ~158 GB/s (NVLink) and ~58 GB/s (RDMA) for normal kernels, and latencies as low as 77 us for low-latency kernels.
Recent optimizations (2025) from Tencent and others improved performance by up to 30%, enhanced NVLink usage, and added features like multi-QP support and AMD GPU compatibility via backends like MORI.
System requirements include Ampere/Hopper GPUs, Python 3.8+, specific CUDA versions, PyTorch 2.1+, NVLink, RDMA network, and NVSHMEM dependency.
Usage examples cover normal kernels for training/inference prefilling (with SM control and buffer management) and low-latency kernels for inference decoding (with hook-based overlapping).
Advanced features include traffic isolation via InfiniBand Virtual Lanes, adaptive routing recommendations, and optimizations like undefined-behavior PTX instructions for performance (can be disabled).
The library is MIT-licensed (except NVSHMEM references) and encourages citation; contributions include zero-copy, eager protocols, hybrid-EP, and diagnostic tools.

Hasty Briefsbeta

GitHub - deepseek-ai/DeepEP: DeepEP: an efficient expert-parallel communication library