Hasty Briefsbeta

Training Foundation Models on a Full-Stack AMD Platform

11 days ago
  • #Large-scale Pretraining
  • #Mixture-of-Experts
  • #AMD
  • First large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware using MI300X GPUs with Pollara interconnect.
  • Comprehensive cluster and networking characterization including microbenchmarks for core collectives (all-reduce, reduce-scatter, all-gather, broadcast) on Pollara.
  • MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design.
  • Introduction of MI300X-aware transformer sizing rules for attention and MLP blocks, optimizing MoE widths for training throughput and inference latency.
  • Detailed training stack description, including fault-tolerance, checkpoint-reshaping, and training recipe.
  • Preview of model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - with competitive performance against leading models like Qwen3-4B and Gemma3-12B.
  • Demonstration of AMD hardware, network, and software stack maturity for competitive large-scale pretraining.