Training Foundation Models on a Full-Stack AMD Platform
11 days ago
- #Large-scale Pretraining
- #Mixture-of-Experts
- #AMD
- First large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware using MI300X GPUs with Pollara interconnect.
- Comprehensive cluster and networking characterization including microbenchmarks for core collectives (all-reduce, reduce-scatter, all-gather, broadcast) on Pollara.
- MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design.
- Introduction of MI300X-aware transformer sizing rules for attention and MLP blocks, optimizing MoE widths for training throughput and inference latency.
- Detailed training stack description, including fault-tolerance, checkpoint-reshaping, and training recipe.
- Preview of model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - with competitive performance against leading models like Qwen3-4B and Gemma3-12B.
- Demonstration of AMD hardware, network, and software stack maturity for competitive large-scale pretraining.