Bringing Up DeepSeek-V4-Flash on AMD MI300X

5 hours ago

MI300X launched in Dec 2023 as AMD's competitor to NVIDIA H100, offering 192GB HBM3, comparable FP8 compute, and lower cost but faces software challenges.
Running DeepSeek-V4-Flash on MI300X with vLLM was not functional as of early May 2026, primarily due to FP8 dialect incompatibility (fnuz vs OCP standard).
The MI300X uses the fnuz FP8 dialect, causing factor-of-two errors in vLLM's FP8 paths; fixes involve platform-specific dtype handling.
DeepSeek v4's sparse attention requires tuned kernels; AITER (AMD's kernel library) has uneven coverage for MI300X, necessitating fallbacks to Triton.
HIP graphs help reduce Python overhead in decode loops but require pure device functions; some Triton kernels needed adjustments for capture safety.
Various bugs were encountered, including MoE routing issues and tensor boundary errors, requiring specific commits for corrections.
Optimization efforts improved performance from 2485 to 2699 output tok/s per GPU (~8.6%), focusing on sparse MLA and MXFP4 paths.
MI300X offers cost and availability advantages over NVIDIA hardware, and software gaps are expected to close with newer AMD chips and vLLM improvements.

Hasty Briefsbeta