Explorations of RDMA in LLM Systems
12 days ago
- #RDMA
- #High-Performance Computing
- #LLM Systems
- The team built an RDMA communication library based on Unordered Reliable Datagram (URD) semantics, compatible with AWS EFA and NVIDIA ConnectX.
- Applied the library to KvCache transfer in disaggregated inference, model-parameter updates in RL post-training, and MoE communication.
- Identified pain points with collective communication, including static participant groups, blocking initialization, unnecessary ordering guarantees, and rigid tensor requirements.
- Highlighted challenges with RDMA, such as lack of portable libraries, vendor lock-in (NVIDIA ConnectX), and performance discrepancies across NICs.
- Developed a general RDMA library focusing on reliable, unordered delivery, supporting both two-sided SEND/RECV and one-sided WRITE_IMM operations.
- Optimized MoE kernel performance, achieving better decode speeds than DeepEP on ConnectX-7 and usable performance on EFA.
- Shared insights on SRD vs. RC protocols, emphasizing SRD's advantages in programming simplicity despite EFA's lower bandwidth.
- Open-sourced the library and published findings, including arXiv paper, GitHub repository, and blog posts.
- Reflected on the team's rapid progress in RDMA and systems optimization, from initial struggles to significant contributions in less than a year.