Hasty Briefsbeta

The Network Times: AI Cluster Networking

3 days ago
  • #Ultra Ethernet
  • #AI Cluster Networking
  • #RDMA
  • The Ultra Ethernet Specification v1.0 (UES) defines end-to-end communication practices for RDMA services in AI and HPC workloads over Ethernet.
  • UES introduces Ultra Ethernet Transport (UET), a new RDMA-optimized transport layer protocol, and adjusts the full application stack for improved RDMA services.
  • AI clusters consist of Scale-Out Backend Networks for inter-node GPU communication, Scale-Up Networks for intra-node GPU communication, Frontend Networks for user inference, Management Networks for orchestration, and Storage Networks for data access.
  • Scale-Out Backend Networks require low-latency, lossless RDMA message transport and typically use Clos topologies.
  • Scale-Up Networks use high-bandwidth, low-latency technologies like NVLink or AMD Infinity Fabric for intra-node GPU communication.
  • Frontend Networks handle user inference requests, often using BGP EVPN and VXLAN for multitenancy.
  • Management Networks are dedicated to orchestration and administration, ensuring secure and reliable connectivity.
  • Storage Networks connect compute nodes to storage infrastructure, supporting high-performance data access with protocols like NVMe-oF.