Learning about GPUs through measuring memory bandwidth

4 days ago

Copy Link

Traverse Research uses microbenchmarks to measure GPU memory bandwidth for insights into hardware performance.
GPU memory access is complex, involving descriptors, different buffer types (Byte Address, Structured, Typed), and texture units.
Memory hierarchy in GPUs includes multiple cache levels (L0, L1, L2, etc.) to balance size and performance.
Caches use write-back, write-through, or write-around strategies, with write-back being common for write combining.
GPUs hide latency by keeping more threads in flight, but excessive threads can cause cache thrashing.
Microbenchmark design involves reading/writing data in loops, avoiding cache hits, and testing different element sizes.
Findings include performance differences between textures and buffers, and bottlenecks with specific data types.
Qualcomm Adreno 740 (Meta Quest 3) shows significant bandwidth differences between buffers and textures in main memory.
AMD Radeon RX 9070 XT has fast L0 cache (20 TiB/s for buffers, 11 TiB/s for textures) and ALU bottlenecks with integer operations.
Intel Arc B580 shows varying bandwidth for different data types when loading from textures vs buffers.
NVIDIA GeForce RTX 5070 Ti experiences bottlenecks with many writes to the same memory and differences in uint1 vs float1 loads.
Microbenchmarks reveal hardware-specific optimizations and mysteries, guiding performance tuning for specific GPUs.

Hasty Briefsbeta