Writing Speed-of-Light Flash Attention for 5090 in CUDA C++
18 hours ago
- #CUDA
- #GPU
- #Optimization
- The blog post details the implementation of Flash Attention for NVIDIA's 5090 GPU using CUDA C++.
- The author's goal is to learn CUDA C++ by implementing attention mechanisms, as existing tools like Triton lack certain features.
- Performance benchmarks show the author's implementation achieving up to 94.39% of the theoretical speed-of-light (SOL) for the 5090 GPU.
- The post covers five versions of the kernel, each introducing optimizations like shared memory swizzling and pipelining.
- Key optimizations include reducing bank conflicts, overlapping memory operations with compute, and efficient use of Tensor Cores.
- The author compares their implementation against PyTorch's Flash Attention and CuDNN, noting competitive performance.
- Future work suggestions include implementing backward passes, quantized attention, and using Tensor Memory Accelerator (TMA).