Enabling Trillion-Parameter Models on AWS EFA
18 days ago
- #AWS EFA
- #GPU Optimization
- #Mixture-of-Experts
- Perplexity uses large open-source Mixture-of-Experts (MoE) models like Kimi-K2, which require multi-node deployments due to their size.
- MoE models replace dense transformer layers with experts and a routing layer, enabling parallel processing across GPUs.
- Custom kernels for expert parallelism achieve state-of-the-art latencies on ConnectX-7 and viable latencies on AWS EFA, enabling trillion-parameter model deployments.
- MoE routing involves sparse peer-to-peer communication, requiring specialized dispatch and combine kernels for low latency.
- Perplexity developed portable inter-node and specialized intra-node kernels to optimize MoE communication.
- AWS EFA lacks GPUDirect Async support, necessitating a CPU proxy thread, which adds overhead compared to ConnectX-7.
- New hybrid CPU-GPU kernels use TransferEngine for KV cache transfers, optimizing MoE routing over EFA and NVLink.
- Dispatch and combine kernels are split into sender and receiver halves to overlap computation and communication.
- Performance evaluations show competitive latencies on ConnectX-7 and EFA, with improvements over DeepEP and NVSHMEM-based kernels.
- Future work includes collaborating with AWS to enhance EFA performance and experimenting with efa-direct for reduced overhead.