HipKittens: Fast and furious AMD kernels

9 days ago

Copy Link

HipKittens introduces state-of-the-art AMD kernels and programming primitives to simplify AMD kernel development.
AMD GPUs offer competitive peak compute and memory bandwidth but lack mature AI software, limiting their use in AI workflows.
Existing AMD software (AITER, PyTorch, Triton, Mojo, TileLang, Composable Kernel) often underperforms or is brittle, failing to consistently achieve peak performance.
Hand-optimized assembly is currently required for the most performant AMD AI kernels, making it difficult to scale across diverse AI workloads.
HipKittens provides a minimal, opinionated collection of C++ embedded programming primitives, demonstrating that tile-based abstractions can generalize across architectures.
HipKittens kernels outperform existing AMD baselines, including hand-optimized assembly, in attention forwards, GEMM, attention backwards pass, rotary, and fused dropout-residual-layernorm operations.
The goal is to enable multi-silicon AI systems by making AMD GPUs more accessible and performant for AI workloads.

Hasty Briefsbeta