Hasty Briefsbeta

Bilingual

Anatomy of a high-performance EP kernel

6 hours ago
  • #Expert Parallelism
  • #MoE Inference
  • #GPU Communication
  • Large language models require multiple GPUs for inference, necessitating communication between GPUs, which can be achieved through various parallelism techniques.
  • Expert Parallelism (EP) is essential for Mixture of Experts (MoE) models at large scale, as it handles dynamic routing where tokens are assigned to experts at runtime.
  • DeepEP library sets the standard for high-performance EP kernels, with two main optimizations: high throughput (with coordination pass) and low latency (without coordination pass).
  • High-throughput dispatch involves a coordination pass to exchange token counts, enabling allocation of compact buffers and efficient data transfer via queues, followed by a local permute for grouped GEMM.
  • Combine reverses the dispatch process, using routing information to return and sum expert outputs, applying gate weights to form the final token outputs.
  • Low-latency dispatch avoids the coordination pass by using pre-reserved fixed buffers per (source rank, expert) pair, with quantized FP8 payloads to reduce transfer size, but may waste memory due to padding.
  • Combine in low latency directly writes outputs to precomputed slots on the home rank, using flags to signal completion, enabling fast weighted summation.
  • Modern extensions of EP include Expert Load Balancing (EPLB), elastic EP in vLLM, and fusion of communication with compute kernels for better overlap and pipelining.