Writing high-performance matrix multiplication kernels for Blackwell

10 hours ago

Copy Link

Guide to writing high-performance matrix multiplication kernels for Blackwell.
Initial implementation is simple but slow, can be modified to match or exceed cuBLAS and CUTLASS.
Warning about benchmark differences due to input data distribution.
Performance metrics for different implementations compared to cuBLAS and CUTLASS.
Basic kernel achieves 37.62% TensorCore utilization, 59.4% of cuBLAS.
Warp specialization improves utilization to 45.47%, 71.7% of cuBLAS.
Tiled epilogue further increases utilization to 55.82%, 88.1% of cuBLAS.
Collective (2CTA) MMA reaches 59.41% utilization, 93.7% of cuBLAS.
Persistent kernel achieves 61.46% utilization, 97.0% of cuBLAS.
Dedicated epilogue warpgroup reaches 63.38% utilization, matching cuBLAS.
Grid tiling achieves 69.44% utilization, exceeding cuBLAS by 109.6%.
Final kernel implementation is less than 150 lines and reaches state-of-the-art performance.
Detailed explanation of each optimization step including code snippets.
Use of collective MMAs to double arithmetic intensity.
Persistent kernels to amortize block initialization costs.
Dedicated epilogue warpgroups to overlap compute and memory operations.
Grid tiling to better utilize L2 cache.

Hasty Briefsbeta