We Bought the Whole GPU, So We're Damn Well Going to Use the Whole GPU
4 days ago
- #GPU Optimization
- #High-Throughput Inference
- #Megakernel
- Release of a throughput-optimized megakernel for tensor-parallel inference with Llama-70B on H100s.
- The megakernel aggressively overlaps compute, memory, and communication operations to utilize GPU resources fully.
- Performance improvement of over 22% on end-to-end throughput compared to SGLang on the ShareGPT benchmark.
- Introduction of a novel distributed transpose operation for efficient cross-GPU communication post-attention.
- Detailed exploration of overlapping techniques within SMs, across SMs, and across GPUs to maximize resource utilization.
- Comparison of the megakernel's performance against existing systems like vLLM and SGLang, showcasing superior throughput.
- Discussion on future directions for simplifying megakernel design and extending the approach to training scenarios.