Hasty Briefsbeta

Bilingual

Basic Facts about GPUs

10 months ago
  • #GPU Architecture
  • #Roofline Model
  • #Performance Optimization
  • GPUs have a significant imbalance between compute speed and memory bandwidth, with NVIDIA A100 capable of 19.5 TFLOPS but only 1.5 TB/s memory bandwidth.
  • The GPU memory hierarchy includes Global Memory (VRAM), Shared Memory (SRAM), and Registers, each with different speeds and purposes.
  • Threads are grouped into Warps (32 threads) and Blocks, which run on Streaming Multiprocessors (SMs).
  • Kernel performance is limited by either memory bandwidth or compute throughput, categorized as memory-bound or compute-bound operations.
  • Arithmetic Intensity (AI) is the ratio of FLOPs to bytes accessed, determining whether a kernel is memory-bound or compute-bound.
  • The Roofline Model visualizes performance limits, with the ridge point (e.g., ~13 FLOPs/Byte for A100) separating memory-bound and compute-bound regions.
  • Optimization strategies include operator fusion to reduce memory traffic and increasing data reuse via Shared Memory.
  • Matrix multiplication can be optimized by loading tiles into Shared Memory and using cooperative thread strategies to increase AI.
  • Memory access patterns must be coalesced for efficiency, and Shared Memory must avoid bank conflicts.
  • Thread divergence and occupancy are critical factors affecting performance, with high occupancy helping hide latency.
  • Quantization can improve performance by reducing memory usage and enabling faster low-precision operations.