A gentle introduction to GEMM using MMA tensor cores

18 hours ago

Copy Link

Introduction to GEMM using MMA tensor cores with a reverse approach, starting from the smallest tile.
Tensor cores are fast GPU hardware components, part of Shared Multiprocessors (SMs), requiring high throughput.
MMA (Matrix Multiply-Accumulate) instructions operate at warp-level (32 threads) for small GEMM computations.
Explanation of the mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 instruction and its parameters.
Detailed thread-mapping for loading matrix fragments A, B, and C into registers.
Introduction to ldmatrix for efficient loading of matrix fragments from shared memory to registers.
Microkernel implementation using ldmatrix for better memory coalescing and performance.
Extension to full GEMM by handling larger matrices via grid of blocks and accumulation over K.
Benchmark comparison showing the initial kernel's performance vs. cuBLAS, highlighting optimization needs.
Discussion on optimization principles focusing on data movement, reuse, and locality to improve performance.

Hasty Briefsbeta