GLM5.2 on AMD MI355X at 2626 tok/s/node at over 2x lower cost than Blackwell

22 days ago

AMD MI355X GPUs offer significantly lower cost per GPU (about 2.75x cheaper) compared to NVIDIA Blackwell B300, providing a cost-effective solution for AI inference.
Wafer achieved 2626 tok/s/node aggregate throughput with GLM5.2 on AMD MI355X at 2.4 RPS, achieving 80% of B200 performance at over 2x lower cost, and 213 tok/s in single-stream decoding.
Optimizations included quantizing GLM-5.2 to lossless MXFP4 with AMD Quark, using sglang as the inference framework, fixing speculative decode issues, and tuning MoE kernels to improve prefill performance.
AMD faces challenges with day-0 support and software friction for frontier models, but the performance gap is narrowing as kernel and model optimizations improve, reducing reliance on custom kernels.
The study highlights that achieving state-of-the-art performance on AMD is increasingly about support and optimization rather than inherent software limitations, indicating erosion of NVIDIA's CUDA advantage.

Hasty Briefsbeta