Supercomputer networking to accelerate large scale AI training

7 hours ago

MRC protocol developed by OpenAI with partners enhances GPU networking for large AI training clusters.
It reduces complexity in network design to support efficient compute at scale.
MRC uses multi-plane networks, adaptive packet spraying, and static source routing to improve redundancy.
It minimizes congestion, routes around failures quickly, and simplifies network control.
MRC is deployed across OpenAI's largest supercomputers and published as an OCP specification.
It ensures predictable performance during training despite frequent link and switch failures.
MRC supports synchronous pretraining by preventing job crashes and reducing idle GPU time.
The protocol allows for simpler, lower-cost network topologies with fewer tiers.
Cross-industry collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA facilitated MRC development.
MRC is critical for scaling supercomputers to over 100,000 GPUs and advancing frontier model training.

Hasty Briefsbeta