ParallelKittens: Simple and Fast Multi-GPU AI Kernels

5 days ago

Copy Link

Efforts to make AI more efficient include reducing compute usage, hardware-awareness, and multi-vendor support.
Recent advancements in GPU networking hardware, like NVSwitch 4th generation and TMA, open new AI efficiency opportunities.
Extended ThunderKittens to support multi-GPU kernels, explored hardware-driven principles, and built new kernels.
Key observations: transfer mechanisms, scheduling strategies, design overheads, and tile-granularity communication.
ThunderKittens matches or surpasses state-of-the-art implementations in various parallel strategies.
Future plans include inter-node communication, repo cleanup, and new applications like load-balancing MoEs.

Hasty Briefsbeta