Training Foundation Models on a Full-Stack AMD Platform

11 days ago

Copy Link

First large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware using MI300X GPUs with Pollara interconnect.
Comprehensive cluster and networking characterization including microbenchmarks for core collectives (all-reduce, reduce-scatter, all-gather, broadcast) on Pollara.
MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design.
Introduction of MI300X-aware transformer sizing rules for attention and MLP blocks, optimizing MoE widths for training throughput and inference latency.
Detailed training stack description, including fault-tolerance, checkpoint-reshaping, and training recipe.
Preview of model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - with competitive performance against leading models like Qwen3-4B and Gemma3-12B.
Demonstration of AMD hardware, network, and software stack maturity for competitive large-scale pretraining.

Hasty Briefsbeta