Hasty Briefsbeta

Bilingual

AMD's CDNA 4 Architecture Announcement – By Chester Lam

a year ago
  • #AMD-CDNA4
  • #machine-learning
  • #GPU-architecture
  • CDNA 4 is AMD’s latest compute-oriented GPU architecture, focusing on boosting matrix multiplication performance for machine learning workloads.
  • CDNA 4 maintains AMD’s lead in vector operations while improving low-precision matrix throughput, doubling per-CU matrix performance in some cases.
  • The architecture uses a chiplet design similar to CDNA 3, with eight XCDs (Accelerator Compute Dies) atop four base dies, leveraging Infinity Fabric for coherent memory access.
  • Compared to Nvidia’s B200, AMD’s MI355X (CDNA 4) has more compute units but slightly lower per-unit performance, relying on higher clock speeds to compensate.
  • CDNA 4 increases LDS (Local Data Share) capacity to 160 KB and doubles read bandwidth, improving efficiency for thread-local data storage.
  • New LDS instructions, including read-with-transpose, optimize matrix multiplication by handling inefficient memory access patterns more effectively.
  • MI355X upgrades to HBM3E memory, offering higher bandwidth (8 TB/s) and capacity (288 GB) compared to Nvidia’s B200 (7.7 TB/s, 180 GB).
  • AMD retains a significant advantage in vector throughput and high-precision compute, while Nvidia leads in low-precision matrix operations.
  • CDNA 4’s improvements are incremental, refining CDNA 3’s design rather than overhauling it, similar to AMD’s Zen 3 to Zen 4 transition.
  • AMD’s strategy mirrors Nvidia’s focus on refining successful architectures, with CDNA 4 building on the MI300X’s achievements in supercomputing.