Hasty Briefsbeta

Bilingual

Occupancy Math on the AMD MI355X: A From-First-Principles Guide

3 days ago
  • #AMD CDNA4
  • #GPU Occupancy
  • #Performance Tuning
  • Occupancy on AMD MI355X (CDNA4) is the fraction of wavefront slots kept filled on a SIMD, determined by the min of four resource limiters: VGPRs, SGPRs, LDS, and workgroup/barrier slots.
  • VGPRs are from a 512-entry-per-lane file shared between regular and accumulator registers, not a separate pool, with allocation granularity of 8.
  • Occupancy is computed as min(floor(512 / total-VGPRs-per-lane), floor(~800 / SGPRs-per-wave), floor(160KB / LDS-per-workgroup), workgroup limit), clamped by hardware caps (8 waves/SIMD, 32/CU).
  • LDS is 160 KB per CU (shared), a major increase from CDNA3's 64 KB, shifting bottlenecks from LDS to VGPRs in many kernels.
  • Hand calculations may drift due to granularity rounding; resources like VGPRs round up in blocks, affecting occupancy.
  • Occupancy is often not the primary optimization goal; high occupancy doesn't guarantee performance, as shown by a microbenchmark where matrix core utilization stays near 97% even at low occupancy.
  • Little's Law explains latency hiding: parallelism needed equals latency × throughput, achievable via thread-level parallelism (TLP, occupancy) or instruction-level parallelism (ILP, within a wave).
  • ILP can be more effective than TLP; a kernel with high ILP (e.g., 8 independent MFMA chains) saturates the matrix core at low occupancy, while low ILP requires high occupancy for similar throughput.
  • Optimization should focus on keeping the matrix core fed, using registers for large tiles and LDS for pipeline depth, rather than maximizing occupancy.
  • Workflow: read resource usage from binary, compute occupancy ceiling, check matrix-engine utilization, and prioritize ILP or tile size over wave count if latency-bound.