The Network Times: AI Cluster Networking
3 days ago
- #Ultra Ethernet
- #AI Cluster Networking
- #RDMA
- The Ultra Ethernet Specification v1.0 (UES) defines end-to-end communication practices for RDMA services in AI and HPC workloads over Ethernet.
- UES introduces Ultra Ethernet Transport (UET), a new RDMA-optimized transport layer protocol, and adjusts the full application stack for improved RDMA services.
- AI clusters consist of Scale-Out Backend Networks for inter-node GPU communication, Scale-Up Networks for intra-node GPU communication, Frontend Networks for user inference, Management Networks for orchestration, and Storage Networks for data access.
- Scale-Out Backend Networks require low-latency, lossless RDMA message transport and typically use Clos topologies.
- Scale-Up Networks use high-bandwidth, low-latency technologies like NVLink or AMD Infinity Fabric for intra-node GPU communication.
- Frontend Networks handle user inference requests, often using BGP EVPN and VXLAN for multitenancy.
- Management Networks are dedicated to orchestration and administration, ensuring secure and reliable connectivity.
- Storage Networks connect compute nodes to storage infrastructure, supporting high-performance data access with protocols like NVMe-oF.