AI Datacenters Were Built for GPUs. What Happens When You Remove the GPUs?
3 days ago
- #AI Networking
- #Datacenter Infrastructure
- #Distributed Training
- Traditional datacenter networking focused on north-south traffic, tolerating delays, but AI training shifted it to east-west patterns, making the network critical for accelerator utilization.
- AI clusters act as distributed supercomputers with synchronized GPUs, where packet delays stall thousands of units, emphasizing Job Completion Time over average latency.
- Modern AI networks use RDMA via RoCEv2 for low latency but are sensitive to packet loss, relying on Priority Flow Control which can cause head-of-line blocking and congestion.
- NVIDIA's InfiniBand addressed these issues with a lossless, deterministic fabric, but it's costly and proprietary, leading to rigid, rail-optimized topologies to scale clusters.
- Traditional routing like ECMP struggles with AI's elephant flows, prompting Dynamic Load Balancing and packet-spraying in switches to improve load distribution and reduce congestion.
- The Ultra Ethernet Consortium (UEC) re-architects Ethernet for AI, using packet spraying and Virtual Output Queueing to challenge InfiniBand without losing Ethernet's ecosystem benefits.
- Almartis proposes an alternative associative memory architecture, reducing synchronization needs by focusing on memory locality and deterministic retrieval, enabling a GPU-free, 1-tier mesh datacenter.
- Future AI infrastructure may prioritize minimizing coordination latency in structured memory systems over maximizing synchronized throughput, potentially reducing the need for extensive GPU clusters.