We Bought the Whole GPU, So We're Damn Well Going to Use the Whole GPU

4 days ago

Copy Link

Release of a throughput-optimized megakernel for tensor-parallel inference with Llama-70B on H100s.
The megakernel aggressively overlaps compute, memory, and communication operations to utilize GPU resources fully.
Performance improvement of over 22% on end-to-end throughput compared to SGLang on the ShareGPT benchmark.
Introduction of a novel distributed transpose operation for efficient cross-GPU communication post-attention.
Detailed exploration of overlapping techniques within SMs, across SMs, and across GPUs to maximize resource utilization.
Comparison of the megakernel's performance against existing systems like vLLM and SGLang, showcasing superior throughput.
Discussion on future directions for simplifying megakernel design and extending the approach to training scenarios.

Hasty Briefsbeta