UCCL-EP: DeepEP-style expert parallelism on any NIC, no GPU-initiated comms
4 days ago
- #UCCL-EP
- #Expert Parallelism
- #GPU Communication
- UCCL-EP extends expert parallel (EP) communication kernels like DeepEP to arbitrary NIC-accelerator pairs by reimplementing GPU-initiated communication primitives.
- DeepEP relies on GPU-initiated communication via NVSHMEM's device API, requiring NVIDIA GPUs and NICs, which limits hardware compatibility.
- UCCL-EP uses a proxy thread on the CPU to mediate GPU commands, allowing GPUs to write commands to host memory rings that the CPU then executes on any NIC.
- The contract includes one-sided writes, ordered signals, and quiet operations, maintained by UCCL-EP's shim functions nvshmemi_ibgda_put_nbi_warp, nvshmemi_ibgda_amo_nonfetch_add, and nvshmemi_ibgda_quiet.
- Performance remains competitive with DeepEP on NVIDIA hardware and significantly improves on non-NVIDIA systems like AWS EFA and AMD-Broadcom clusters.