AMD's CDNA 4 Architecture Announcement – By Chester Lam

a year ago

CDNA 4 is AMD’s latest compute-oriented GPU architecture, focusing on boosting matrix multiplication performance for machine learning workloads.
CDNA 4 maintains AMD’s lead in vector operations while improving low-precision matrix throughput, doubling per-CU matrix performance in some cases.
The architecture uses a chiplet design similar to CDNA 3, with eight XCDs (Accelerator Compute Dies) atop four base dies, leveraging Infinity Fabric for coherent memory access.
Compared to Nvidia’s B200, AMD’s MI355X (CDNA 4) has more compute units but slightly lower per-unit performance, relying on higher clock speeds to compensate.
CDNA 4 increases LDS (Local Data Share) capacity to 160 KB and doubles read bandwidth, improving efficiency for thread-local data storage.
New LDS instructions, including read-with-transpose, optimize matrix multiplication by handling inefficient memory access patterns more effectively.
MI355X upgrades to HBM3E memory, offering higher bandwidth (8 TB/s) and capacity (288 GB) compared to Nvidia’s B200 (7.7 TB/s, 180 GB).
AMD retains a significant advantage in vector throughput and high-precision compute, while Nvidia leads in low-precision matrix operations.
CDNA 4’s improvements are incremental, refining CDNA 3’s design rather than overhauling it, similar to AMD’s Zen 3 to Zen 4 transition.
AMD’s strategy mirrors Nvidia’s focus on refining successful architectures, with CDNA 4 building on the MI300X’s achievements in supercomputing.

Hasty Briefsbeta