Decoupled DiLoCo: Resilient, Distributed AI Training at Scale
4 hours ago
- #AI training
- #distributed computing
- #hardware resilience
- Decoupled DiLoCo is a new distributed architecture for training large AI models across distant data centers with lower bandwidth and higher hardware resilience.
- It divides training across decoupled compute islands with asynchronous data flow, isolating local disruptions and allowing other parts to continue learning.
- The architecture builds on Pathways and DiLoCo, enabling asynchronous training and self-healing capabilities that maintain training even after hardware failures.
- Testing with Gemma 4 models showed it maintains greater availability during failures while delivering the same ML performance as traditional methods.
- It successfully trained a 12B parameter model across four U.S. regions using 2-5 Gbps internet, over 20 times faster than conventional synchronization methods.
- The system allows mixing different hardware generations (e.g., TPU v6e and v5p) in a single run, extending hardware life and increasing available compute without compromising performance.