Basic Facts about GPUs
10 months ago
- #GPU Architecture
- #Roofline Model
- #Performance Optimization
- GPUs have a significant imbalance between compute speed and memory bandwidth, with NVIDIA A100 capable of 19.5 TFLOPS but only 1.5 TB/s memory bandwidth.
- The GPU memory hierarchy includes Global Memory (VRAM), Shared Memory (SRAM), and Registers, each with different speeds and purposes.
- Threads are grouped into Warps (32 threads) and Blocks, which run on Streaming Multiprocessors (SMs).
- Kernel performance is limited by either memory bandwidth or compute throughput, categorized as memory-bound or compute-bound operations.
- Arithmetic Intensity (AI) is the ratio of FLOPs to bytes accessed, determining whether a kernel is memory-bound or compute-bound.
- The Roofline Model visualizes performance limits, with the ridge point (e.g., ~13 FLOPs/Byte for A100) separating memory-bound and compute-bound regions.
- Optimization strategies include operator fusion to reduce memory traffic and increasing data reuse via Shared Memory.
- Matrix multiplication can be optimized by loading tiles into Shared Memory and using cooperative thread strategies to increase AI.
- Memory access patterns must be coalesced for efficiency, and Shared Memory must avoid bank conflicts.
- Thread divergence and occupancy are critical factors affecting performance, with high occupancy helping hide latency.
- Quantization can improve performance by reducing memory usage and enabling faster low-precision operations.