Writing Speed-of-Light Flash Attention for 5090 in CUDA C++

a day ago

Copy Link

The blog post details the implementation of Flash Attention for NVIDIA's 5090 GPU using CUDA C++.
The author's goal is to learn CUDA C++ by implementing attention mechanisms, as existing tools like Triton lack certain features.
Performance benchmarks show the author's implementation achieving up to 94.39% of the theoretical speed-of-light (SOL) for the 5090 GPU.
The post covers five versions of the kernel, each introducing optimizations like shared memory swizzling and pipelining.
Key optimizations include reducing bank conflicts, overlapping memory operations with compute, and efficient use of Tensor Cores.
The author compares their implementation against PyTorch's Flash Attention and CuDNN, noting competitive performance.
Future work suggestions include implementing backward passes, quantized attention, and using Tensor Memory Accelerator (TMA).

Hasty Briefsbeta