Writing high-performance matrix multiplication kernels for Blackwell
8 hours ago
- #GPU-optimization
- #high-performance-computing
- #matrix-multiplication
- Guide to writing high-performance matrix multiplication kernels for Blackwell.
- Initial implementation is simple but slow, can be modified to match or exceed cuBLAS and CUTLASS.
- Warning about benchmark differences due to input data distribution.
- Performance metrics for different implementations compared to cuBLAS and CUTLASS.
- Basic kernel achieves 37.62% TensorCore utilization, 59.4% of cuBLAS.
- Warp specialization improves utilization to 45.47%, 71.7% of cuBLAS.
- Tiled epilogue further increases utilization to 55.82%, 88.1% of cuBLAS.
- Collective (2CTA) MMA reaches 59.41% utilization, 93.7% of cuBLAS.
- Persistent kernel achieves 61.46% utilization, 97.0% of cuBLAS.
- Dedicated epilogue warpgroup reaches 63.38% utilization, matching cuBLAS.
- Grid tiling achieves 69.44% utilization, exceeding cuBLAS by 109.6%.
- Final kernel implementation is less than 150 lines and reaches state-of-the-art performance.
- Detailed explanation of each optimization step including code snippets.
- Use of collective MMAs to double arithmetic intensity.
- Persistent kernels to amortize block initialization costs.
- Dedicated epilogue warpgroups to overlap compute and memory operations.
- Grid tiling to better utilize L2 cache.