Hasty Briefsbeta

Bilingual

UCCL-EP: DeepEP-style expert parallelism on any NIC, no GPU-initiated comms

4 days ago
  • #UCCL-EP
  • #Expert Parallelism
  • #GPU Communication
  • UCCL-EP extends expert parallel (EP) communication kernels like DeepEP to arbitrary NIC-accelerator pairs by reimplementing GPU-initiated communication primitives.
  • DeepEP relies on GPU-initiated communication via NVSHMEM's device API, requiring NVIDIA GPUs and NICs, which limits hardware compatibility.
  • UCCL-EP uses a proxy thread on the CPU to mediate GPU commands, allowing GPUs to write commands to host memory rings that the CPU then executes on any NIC.
  • The contract includes one-sided writes, ordered signals, and quiet operations, maintained by UCCL-EP's shim functions nvshmemi_ibgda_put_nbi_warp, nvshmemi_ibgda_amo_nonfetch_add, and nvshmemi_ibgda_quiet.
  • Performance remains competitive with DeepEP on NVIDIA hardware and significantly improves on non-NVIDIA systems like AWS EFA and AMD-Broadcom clusters.