Explorations of RDMA in LLM Systems

12 days ago

Copy Link

The team built an RDMA communication library based on Unordered Reliable Datagram (URD) semantics, compatible with AWS EFA and NVIDIA ConnectX.
Applied the library to KvCache transfer in disaggregated inference, model-parameter updates in RL post-training, and MoE communication.
Identified pain points with collective communication, including static participant groups, blocking initialization, unnecessary ordering guarantees, and rigid tensor requirements.
Highlighted challenges with RDMA, such as lack of portable libraries, vendor lock-in (NVIDIA ConnectX), and performance discrepancies across NICs.
Developed a general RDMA library focusing on reliable, unordered delivery, supporting both two-sided SEND/RECV and one-sided WRITE_IMM operations.
Optimized MoE kernel performance, achieving better decode speeds than DeepEP on ConnectX-7 and usable performance on EFA.
Shared insights on SRD vs. RC protocols, emphasizing SRD's advantages in programming simplicity despite EFA's lower bandwidth.
Open-sourced the library and published findings, including arXiv paper, GitHub repository, and blog posts.
Reflected on the team's rapid progress in RDMA and systems optimization, from initial struggles to significant contributions in less than a year.

Hasty Briefsbeta