Decoupled DiLoCo: Resilient, Distributed AI Training at Scale

4 hours ago

Decoupled DiLoCo is a new distributed architecture for training large AI models across distant data centers with lower bandwidth and higher hardware resilience.
It divides training across decoupled compute islands with asynchronous data flow, isolating local disruptions and allowing other parts to continue learning.
The architecture builds on Pathways and DiLoCo, enabling asynchronous training and self-healing capabilities that maintain training even after hardware failures.
Testing with Gemma 4 models showed it maintains greater availability during failures while delivering the same ML performance as traditional methods.
It successfully trained a 12B parameter model across four U.S. regions using 2-5 Gbps internet, over 20 times faster than conventional synchronization methods.
The system allows mixing different hardware generations (e.g., TPU v6e and v5p) in a single run, extending hardware life and increasing available compute without compromising performance.

Hasty Briefsbeta