AMD GPUs Go Brrr

8 days ago

Copy Link

HipKittens is an opinionated collection of programming primitives designed to help developers unlock the performance of AMD GPUs for AI workflows.
AMD MI355X GPUs feature 256 compute units (CUs) with unique characteristics compared to NVIDIA GPUs, such as a larger register file and smaller matrix core instructions.
Key differences include AMD's lack of asynchronous matrix multiplication instructions, register reallocation, and tensor memory acceleration compared to NVIDIA.
HipKittens introduces optimized memory access patterns, including explicit register scheduling and swizzle patterns tailored for AMD's unique layouts.
Two scheduling patterns are highlighted for AMD GPUs: 8-wave ping-pong and 4-wave interleave, which trade off programmability and performance.
AMD's chiplet architecture requires chiplet-aware scheduling to optimize cache reuse across processors, addressing NUMA effects at the cache level.
HipKittens achieves competitive performance on AMD CDNA3 and CDNA4, with kernels outperforming AMD baselines and competing with NVIDIA's Blackwell kernels.
The project emphasizes the need for diverse, open hardware in AI to realize its full potential, advocating for broader accessibility of AMD GPUs.

Hasty Briefsbeta