The Network Times: AI Cluster Networking

3 days ago

Copy Link

The Ultra Ethernet Specification v1.0 (UES) defines end-to-end communication practices for RDMA services in AI and HPC workloads over Ethernet.
UES introduces Ultra Ethernet Transport (UET), a new RDMA-optimized transport layer protocol, and adjusts the full application stack for improved RDMA services.
AI clusters consist of Scale-Out Backend Networks for inter-node GPU communication, Scale-Up Networks for intra-node GPU communication, Frontend Networks for user inference, Management Networks for orchestration, and Storage Networks for data access.
Scale-Out Backend Networks require low-latency, lossless RDMA message transport and typically use Clos topologies.
Scale-Up Networks use high-bandwidth, low-latency technologies like NVLink or AMD Infinity Fabric for intra-node GPU communication.
Frontend Networks handle user inference requests, often using BGP EVPN and VXLAN for multitenancy.
Management Networks are dedicated to orchestration and administration, ensuring secure and reliable connectivity.
Storage Networks connect compute nodes to storage infrastructure, supporting high-performance data access with protocols like NVMe-oF.

Hasty Briefsbeta