HipKittens: Fast and furious AMD kernels
9 days ago
- #GPU Programming
- #AI Kernels
- #AMD
- HipKittens introduces state-of-the-art AMD kernels and programming primitives to simplify AMD kernel development.
- AMD GPUs offer competitive peak compute and memory bandwidth but lack mature AI software, limiting their use in AI workflows.
- Existing AMD software (AITER, PyTorch, Triton, Mojo, TileLang, Composable Kernel) often underperforms or is brittle, failing to consistently achieve peak performance.
- Hand-optimized assembly is currently required for the most performant AMD AI kernels, making it difficult to scale across diverse AI workloads.
- HipKittens provides a minimal, opinionated collection of C++ embedded programming primitives, demonstrating that tile-based abstractions can generalize across architectures.
- HipKittens kernels outperform existing AMD baselines, including hand-optimized assembly, in attention forwards, GEMM, attention backwards pass, rotary, and fused dropout-residual-layernorm operations.
- The goal is to enable multi-silicon AI systems by making AMD GPUs more accessible and performant for AI workloads.