Basic Facts about GPUs

10 months ago

GPUs have a significant imbalance between compute speed and memory bandwidth, with NVIDIA A100 capable of 19.5 TFLOPS but only 1.5 TB/s memory bandwidth.
The GPU memory hierarchy includes Global Memory (VRAM), Shared Memory (SRAM), and Registers, each with different speeds and purposes.
Threads are grouped into Warps (32 threads) and Blocks, which run on Streaming Multiprocessors (SMs).
Kernel performance is limited by either memory bandwidth or compute throughput, categorized as memory-bound or compute-bound operations.
Arithmetic Intensity (AI) is the ratio of FLOPs to bytes accessed, determining whether a kernel is memory-bound or compute-bound.
The Roofline Model visualizes performance limits, with the ridge point (e.g., ~13 FLOPs/Byte for A100) separating memory-bound and compute-bound regions.
Optimization strategies include operator fusion to reduce memory traffic and increasing data reuse via Shared Memory.
Matrix multiplication can be optimized by loading tiles into Shared Memory and using cooperative thread strategies to increase AI.
Memory access patterns must be coalesced for efficiency, and Shared Memory must avoid bank conflicts.
Thread divergence and occupancy are critical factors affecting performance, with high occupancy helping hide latency.
Quantization can improve performance by reducing memory usage and enabling faster low-precision operations.

Hasty Briefsbeta