Hasty Briefsbeta

Writing high-performance matrix multiplication kernels for Blackwell

10 hours ago
  • #GPU-optimization
  • #high-performance-computing
  • #matrix-multiplication
  • Guide to writing high-performance matrix multiplication kernels for Blackwell.
  • Initial implementation is simple but slow, can be modified to match or exceed cuBLAS and CUTLASS.
  • Warning about benchmark differences due to input data distribution.
  • Performance metrics for different implementations compared to cuBLAS and CUTLASS.
  • Basic kernel achieves 37.62% TensorCore utilization, 59.4% of cuBLAS.
  • Warp specialization improves utilization to 45.47%, 71.7% of cuBLAS.
  • Tiled epilogue further increases utilization to 55.82%, 88.1% of cuBLAS.
  • Collective (2CTA) MMA reaches 59.41% utilization, 93.7% of cuBLAS.
  • Persistent kernel achieves 61.46% utilization, 97.0% of cuBLAS.
  • Dedicated epilogue warpgroup reaches 63.38% utilization, matching cuBLAS.
  • Grid tiling achieves 69.44% utilization, exceeding cuBLAS by 109.6%.
  • Final kernel implementation is less than 150 lines and reaches state-of-the-art performance.
  • Detailed explanation of each optimization step including code snippets.
  • Use of collective MMAs to double arithmetic intensity.
  • Persistent kernels to amortize block initialization costs.
  • Dedicated epilogue warpgroups to overlap compute and memory operations.
  • Grid tiling to better utilize L2 cache.