Anatomy of a high-performance EP kernel
6 hours ago
- #Expert Parallelism
- #MoE Inference
- #GPU Communication
- Large language models require multiple GPUs for inference, necessitating communication between GPUs, which can be achieved through various parallelism techniques.
- Expert Parallelism (EP) is essential for Mixture of Experts (MoE) models at large scale, as it handles dynamic routing where tokens are assigned to experts at runtime.
- DeepEP library sets the standard for high-performance EP kernels, with two main optimizations: high throughput (with coordination pass) and low latency (without coordination pass).
- High-throughput dispatch involves a coordination pass to exchange token counts, enabling allocation of compact buffers and efficient data transfer via queues, followed by a local permute for grouped GEMM.
- Combine reverses the dispatch process, using routing information to return and sum expert outputs, applying gate weights to form the final token outputs.
- Low-latency dispatch avoids the coordination pass by using pre-reserved fixed buffers per (source rank, expert) pair, with quantized FP8 payloads to reduce transfer size, but may waste memory due to padding.
- Combine in low latency directly writes outputs to precomputed slots on the home rank, using flags to signal completion, enabling fast weighted summation.
- Modern extensions of EP include Expert Load Balancing (EPLB), elastic EP in vLLM, and fusion of communication with compute kernels for better overlap and pipelining.