Occupancy Math on the AMD MI355X: A From-First-Principles Guide
3 days ago
- #AMD CDNA4
- #GPU Occupancy
- #Performance Tuning
- Occupancy on AMD MI355X (CDNA4) is the fraction of wavefront slots kept filled on a SIMD, determined by the min of four resource limiters: VGPRs, SGPRs, LDS, and workgroup/barrier slots.
- VGPRs are from a 512-entry-per-lane file shared between regular and accumulator registers, not a separate pool, with allocation granularity of 8.
- Occupancy is computed as min(floor(512 / total-VGPRs-per-lane), floor(~800 / SGPRs-per-wave), floor(160KB / LDS-per-workgroup), workgroup limit), clamped by hardware caps (8 waves/SIMD, 32/CU).
- LDS is 160 KB per CU (shared), a major increase from CDNA3's 64 KB, shifting bottlenecks from LDS to VGPRs in many kernels.
- Hand calculations may drift due to granularity rounding; resources like VGPRs round up in blocks, affecting occupancy.
- Occupancy is often not the primary optimization goal; high occupancy doesn't guarantee performance, as shown by a microbenchmark where matrix core utilization stays near 97% even at low occupancy.
- Little's Law explains latency hiding: parallelism needed equals latency × throughput, achievable via thread-level parallelism (TLP, occupancy) or instruction-level parallelism (ILP, within a wave).
- ILP can be more effective than TLP; a kernel with high ILP (e.g., 8 independent MFMA chains) saturates the matrix core at low occupancy, while low ILP requires high occupancy for similar throughput.
- Optimization should focus on keeping the matrix core fed, using registers for large tiles and LDS for pipeline depth, rather than maximizing occupancy.
- Workflow: read resource usage from binary, compute occupancy ceiling, check matrix-engine utilization, and prioritize ILP or tile size over wave count if latency-bound.