Hasty Briefsbeta

HipKittens: Fast and furious AMD kernels

9 days ago
  • #GPU Programming
  • #AI Kernels
  • #AMD
  • HipKittens introduces state-of-the-art AMD kernels and programming primitives to simplify AMD kernel development.
  • AMD GPUs offer competitive peak compute and memory bandwidth but lack mature AI software, limiting their use in AI workflows.
  • Existing AMD software (AITER, PyTorch, Triton, Mojo, TileLang, Composable Kernel) often underperforms or is brittle, failing to consistently achieve peak performance.
  • Hand-optimized assembly is currently required for the most performant AMD AI kernels, making it difficult to scale across diverse AI workloads.
  • HipKittens provides a minimal, opinionated collection of C++ embedded programming primitives, demonstrating that tile-based abstractions can generalize across architectures.
  • HipKittens kernels outperform existing AMD baselines, including hand-optimized assembly, in attention forwards, GEMM, attention backwards pass, rotary, and fused dropout-residual-layernorm operations.
  • The goal is to enable multi-silicon AI systems by making AMD GPUs more accessible and performant for AI workloads.