Hasty Briefsbeta

  • #AMD GPUs
  • #AI Performance
  • #HipKittens
  • HipKittens is an opinionated collection of programming primitives designed to help developers unlock the performance of AMD GPUs for AI workflows.
  • AMD MI355X GPUs feature 256 compute units (CUs) with unique characteristics compared to NVIDIA GPUs, such as a larger register file and smaller matrix core instructions.
  • Key differences include AMD's lack of asynchronous matrix multiplication instructions, register reallocation, and tensor memory acceleration compared to NVIDIA.
  • HipKittens introduces optimized memory access patterns, including explicit register scheduling and swizzle patterns tailored for AMD's unique layouts.
  • Two scheduling patterns are highlighted for AMD GPUs: 8-wave ping-pong and 4-wave interleave, which trade off programmability and performance.
  • AMD's chiplet architecture requires chiplet-aware scheduling to optimize cache reuse across processors, addressing NUMA effects at the cache level.
  • HipKittens achieves competitive performance on AMD CDNA3 and CDNA4, with kernels outperforming AMD baselines and competing with NVIDIA's Blackwell kernels.
  • The project emphasizes the need for diverse, open hardware in AI to realize its full potential, advocating for broader accessibility of AMD GPUs.