Hasty Briefsbeta

Learning about GPUs through measuring memory bandwidth

4 days ago
  • #Microbenchmark
  • #GPU
  • #Memory Bandwidth
  • Traverse Research uses microbenchmarks to measure GPU memory bandwidth for insights into hardware performance.
  • GPU memory access is complex, involving descriptors, different buffer types (Byte Address, Structured, Typed), and texture units.
  • Memory hierarchy in GPUs includes multiple cache levels (L0, L1, L2, etc.) to balance size and performance.
  • Caches use write-back, write-through, or write-around strategies, with write-back being common for write combining.
  • GPUs hide latency by keeping more threads in flight, but excessive threads can cause cache thrashing.
  • Microbenchmark design involves reading/writing data in loops, avoiding cache hits, and testing different element sizes.
  • Findings include performance differences between textures and buffers, and bottlenecks with specific data types.
  • Qualcomm Adreno 740 (Meta Quest 3) shows significant bandwidth differences between buffers and textures in main memory.
  • AMD Radeon RX 9070 XT has fast L0 cache (20 TiB/s for buffers, 11 TiB/s for textures) and ALU bottlenecks with integer operations.
  • Intel Arc B580 shows varying bandwidth for different data types when loading from textures vs buffers.
  • NVIDIA GeForce RTX 5070 Ti experiences bottlenecks with many writes to the same memory and differences in uint1 vs float1 loads.
  • Microbenchmarks reveal hardware-specific optimizations and mysteries, guiding performance tuning for specific GPUs.