Enabling Trillion-Parameter Models on AWS EFA

18 days ago

Copy Link

Perplexity uses large open-source Mixture-of-Experts (MoE) models like Kimi-K2, which require multi-node deployments due to their size.
MoE models replace dense transformer layers with experts and a routing layer, enabling parallel processing across GPUs.
Custom kernels for expert parallelism achieve state-of-the-art latencies on ConnectX-7 and viable latencies on AWS EFA, enabling trillion-parameter model deployments.
MoE routing involves sparse peer-to-peer communication, requiring specialized dispatch and combine kernels for low latency.
Perplexity developed portable inter-node and specialized intra-node kernels to optimize MoE communication.
AWS EFA lacks GPUDirect Async support, necessitating a CPU proxy thread, which adds overhead compared to ConnectX-7.
New hybrid CPU-GPU kernels use TransferEngine for KV cache transfers, optimizing MoE routing over EFA and NVLink.
Dispatch and combine kernels are split into sender and receiver halves to overlap computation and communication.
Performance evaluations show competitive latencies on ConnectX-7 and EFA, with improvements over DeepEP and NVSHMEM-based kernels.
Future work includes collaborating with AWS to enhance EFA performance and experimenting with efa-direct for reduced overhead.

Hasty Briefsbeta