Bringing Up DeepSeek-V4-Flash on AMD MI300X
5 hours ago
- #AMD MI300X
- #AI Inference
- #DeepSeek-V4-Flash
- MI300X launched in Dec 2023 as AMD's competitor to NVIDIA H100, offering 192GB HBM3, comparable FP8 compute, and lower cost but faces software challenges.
- Running DeepSeek-V4-Flash on MI300X with vLLM was not functional as of early May 2026, primarily due to FP8 dialect incompatibility (fnuz vs OCP standard).
- The MI300X uses the fnuz FP8 dialect, causing factor-of-two errors in vLLM's FP8 paths; fixes involve platform-specific dtype handling.
- DeepSeek v4's sparse attention requires tuned kernels; AITER (AMD's kernel library) has uneven coverage for MI300X, necessitating fallbacks to Triton.
- HIP graphs help reduce Python overhead in decode loops but require pure device functions; some Triton kernels needed adjustments for capture safety.
- Various bugs were encountered, including MoE routing issues and tensor boundary errors, requiring specific commits for corrections.
- Optimization efforts improved performance from 2485 to 2699 output tok/s per GPU (~8.6%), focusing on sparse MLA and MXFP4 paths.
- MI300X offers cost and availability advantages over NVIDIA hardware, and software gaps are expected to close with newer AMD chips and vLLM improvements.