Occupancy Math on the AMD MI355X: A From-First-Principles Guide

3 days ago

#AMD CDNA4
#GPU Occupancy
#Performance Tuning

Occupancy on AMD MI355X (CDNA4) is the fraction of wavefront slots kept filled on a SIMD, determined by the min of four resource limiters: VGPRs, SGPRs, LDS, and workgroup/barrier slots.
VGPRs are from a 512-entry-per-lane file shared between regular and accumulator registers, not a separate pool, with allocation granularity of 8.
Occupancy is computed as min(floor(512 / total-VGPRs-per-lane), floor(~800 / SGPRs-per-wave), floor(160KB / LDS-per-workgroup), workgroup limit), clamped by hardware caps (8 waves/SIMD, 32/CU).
LDS is 160 KB per CU (shared), a major increase from CDNA3's 64 KB, shifting bottlenecks from LDS to VGPRs in many kernels.
Hand calculations may drift due to granularity rounding; resources like VGPRs round up in blocks, affecting occupancy.
Occupancy is often not the primary optimization goal; high occupancy doesn't guarantee performance, as shown by a microbenchmark where matrix core utilization stays near 97% even at low occupancy.
Little's Law explains latency hiding: parallelism needed equals latency × throughput, achievable via thread-level parallelism (TLP, occupancy) or instruction-level parallelism (ILP, within a wave).
ILP can be more effective than TLP; a kernel with high ILP (e.g., 8 independent MFMA chains) saturates the matrix core at low occupancy, while low ILP requires high occupancy for similar throughput.
Optimization should focus on keeping the matrix core fed, using registers for large tiles and LDS for pipeline depth, rather than maximizing occupancy.
Workflow: read resource usage from binary, compute occupancy ceiling, check matrix-engine utilization, and prioritize ILP or tile size over wave count if latency-bound.

Hasty Briefsbeta

Occupancy Math on the AMD MI355X: A From-First-Principles Guide