Hasty Briefsbeta

Bilingual

Decoupled DiLoCo: Resilient, Distributed AI Training at Scale

4 hours ago
  • #AI training
  • #distributed computing
  • #hardware resilience
  • Decoupled DiLoCo is a new distributed architecture for training large AI models across distant data centers with lower bandwidth and higher hardware resilience.
  • It divides training across decoupled compute islands with asynchronous data flow, isolating local disruptions and allowing other parts to continue learning.
  • The architecture builds on Pathways and DiLoCo, enabling asynchronous training and self-healing capabilities that maintain training even after hardware failures.
  • Testing with Gemma 4 models showed it maintains greater availability during failures while delivering the same ML performance as traditional methods.
  • It successfully trained a 12B parameter model across four U.S. regions using 2-5 Gbps internet, over 20 times faster than conventional synchronization methods.
  • The system allows mixing different hardware generations (e.g., TPU v6e and v5p) in a single run, extending hardware life and increasing available compute without compromising performance.