ParallelKittens: Simple and Fast Multi-GPU AI Kernels
5 days ago
- #AI Efficiency
- #ThunderKittens
- #GPU Networking
- Efforts to make AI more efficient include reducing compute usage, hardware-awareness, and multi-vendor support.
- Recent advancements in GPU networking hardware, like NVSwitch 4th generation and TMA, open new AI efficiency opportunities.
- Extended ThunderKittens to support multi-GPU kernels, explored hardware-driven principles, and built new kernels.
- Key observations: transfer mechanisms, scheduling strategies, design overheads, and tile-granularity communication.
- ThunderKittens matches or surpasses state-of-the-art implementations in various parallel strategies.
- Future plans include inter-node communication, repo cleanup, and new applications like load-balancing MoEs.