Hasty Briefsbeta

Bilingual

Bringing Up DeepSeek-V4-Flash on AMD MI300X

5 hours ago
  • #AMD MI300X
  • #AI Inference
  • #DeepSeek-V4-Flash
  • MI300X launched in Dec 2023 as AMD's competitor to NVIDIA H100, offering 192GB HBM3, comparable FP8 compute, and lower cost but faces software challenges.
  • Running DeepSeek-V4-Flash on MI300X with vLLM was not functional as of early May 2026, primarily due to FP8 dialect incompatibility (fnuz vs OCP standard).
  • The MI300X uses the fnuz FP8 dialect, causing factor-of-two errors in vLLM's FP8 paths; fixes involve platform-specific dtype handling.
  • DeepSeek v4's sparse attention requires tuned kernels; AITER (AMD's kernel library) has uneven coverage for MI300X, necessitating fallbacks to Triton.
  • HIP graphs help reduce Python overhead in decode loops but require pure device functions; some Triton kernels needed adjustments for capture safety.
  • Various bugs were encountered, including MoE routing issues and tensor boundary errors, requiring specific commits for corrections.
  • Optimization efforts improved performance from 2485 to 2699 output tok/s per GPU (~8.6%), focusing on sparse MLA and MXFP4 paths.
  • MI300X offers cost and availability advantages over NVIDIA hardware, and software gaps are expected to close with newer AMD chips and vLLM improvements.