AMD GPUs Go Brrr
8 days ago
- #AMD GPUs
- #AI Performance
- #HipKittens
- HipKittens is an opinionated collection of programming primitives designed to help developers unlock the performance of AMD GPUs for AI workflows.
- AMD MI355X GPUs feature 256 compute units (CUs) with unique characteristics compared to NVIDIA GPUs, such as a larger register file and smaller matrix core instructions.
- Key differences include AMD's lack of asynchronous matrix multiplication instructions, register reallocation, and tensor memory acceleration compared to NVIDIA.
- HipKittens introduces optimized memory access patterns, including explicit register scheduling and swizzle patterns tailored for AMD's unique layouts.
- Two scheduling patterns are highlighted for AMD GPUs: 8-wave ping-pong and 4-wave interleave, which trade off programmability and performance.
- AMD's chiplet architecture requires chiplet-aware scheduling to optimize cache reuse across processors, addressing NUMA effects at the cache level.
- HipKittens achieves competitive performance on AMD CDNA3 and CDNA4, with kernels outperforming AMD baselines and competing with NVIDIA's Blackwell kernels.
- The project emphasizes the need for diverse, open hardware in AI to realize its full potential, advocating for broader accessibility of AMD GPUs.