Hasty Briefsbeta

Enabling Trillion-Parameter Models on AWS EFA

18 days ago
  • #AWS EFA
  • #GPU Optimization
  • #Mixture-of-Experts
  • Perplexity uses large open-source Mixture-of-Experts (MoE) models like Kimi-K2, which require multi-node deployments due to their size.
  • MoE models replace dense transformer layers with experts and a routing layer, enabling parallel processing across GPUs.
  • Custom kernels for expert parallelism achieve state-of-the-art latencies on ConnectX-7 and viable latencies on AWS EFA, enabling trillion-parameter model deployments.
  • MoE routing involves sparse peer-to-peer communication, requiring specialized dispatch and combine kernels for low latency.
  • Perplexity developed portable inter-node and specialized intra-node kernels to optimize MoE communication.
  • AWS EFA lacks GPUDirect Async support, necessitating a CPU proxy thread, which adds overhead compared to ConnectX-7.
  • New hybrid CPU-GPU kernels use TransferEngine for KV cache transfers, optimizing MoE routing over EFA and NVLink.
  • Dispatch and combine kernels are split into sender and receiver halves to overlap computation and communication.
  • Performance evaluations show competitive latencies on ConnectX-7 and EFA, with improvements over DeepEP and NVSHMEM-based kernels.
  • Future work includes collaborating with AWS to enhance EFA performance and experimenting with efa-direct for reduced overhead.