Anatomy of a high-performance EP kernel

6 hours ago

Large language models require multiple GPUs for inference, necessitating communication between GPUs, which can be achieved through various parallelism techniques.
Expert Parallelism (EP) is essential for Mixture of Experts (MoE) models at large scale, as it handles dynamic routing where tokens are assigned to experts at runtime.
DeepEP library sets the standard for high-performance EP kernels, with two main optimizations: high throughput (with coordination pass) and low latency (without coordination pass).
High-throughput dispatch involves a coordination pass to exchange token counts, enabling allocation of compact buffers and efficient data transfer via queues, followed by a local permute for grouped GEMM.
Combine reverses the dispatch process, using routing information to return and sum expert outputs, applying gate weights to form the final token outputs.
Low-latency dispatch avoids the coordination pass by using pre-reserved fixed buffers per (source rank, expert) pair, with quantized FP8 payloads to reduce transfer size, but may waste memory due to padding.
Combine in low latency directly writes outputs to precomputed slots on the home rank, using flags to signal completion, enabling fast weighted summation.
Modern extensions of EP include Expert Load Balancing (EPLB), elastic EP in vLLM, and fusion of communication with compute kernels for better overlap and pipelining.

Hasty Briefsbeta