The math behind tiled v/s naive matrix multiplication in CUDA

4 days ago

Copy Link

Matrix multiplication is optimized using 'tiling' to improve resource utilization in power, memory, and compute.
Tiling reduces latency by decreasing memory accesses, crucial for models like transformers that rely on dense matrix multiplication.
The technique involves reusing rows and columns of matrices to minimize fetches, reducing total memory accesses by a factor of the block size.
Parallelization and better memory management are key benefits of tiling, making matrix multiplication faster.
Hardware memory constraints limit the block size, but strategies like partial fetching can help mitigate these limits.
Tiling's effectiveness is quantified by the reduction in memory accesses, which scales with the block size.
The post concludes with practical considerations for implementing tiling, including memory subsystem hierarchies and block size trade-offs.

Hasty Briefsbeta