A gentle introduction to GEMM using MMA tensor cores
18 hours ago
- #CUDA
- #GEMM
- #Tensor Cores
- Introduction to GEMM using MMA tensor cores with a reverse approach, starting from the smallest tile.
- Tensor cores are fast GPU hardware components, part of Shared Multiprocessors (SMs), requiring high throughput.
- MMA (Matrix Multiply-Accumulate) instructions operate at warp-level (32 threads) for small GEMM computations.
- Explanation of the mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 instruction and its parameters.
- Detailed thread-mapping for loading matrix fragments A, B, and C into registers.
- Introduction to ldmatrix for efficient loading of matrix fragments from shared memory to registers.
- Microkernel implementation using ldmatrix for better memory coalescing and performance.
- Extension to full GEMM by handling larger matrices via grid of blocks and accumulation over K.
- Benchmark comparison showing the initial kernel's performance vs. cuBLAS, highlighting optimization needs.
- Discussion on optimization principles focusing on data movement, reuse, and locality to improve performance.