Learning about GPUs through measuring memory bandwidth
4 days ago
- #Microbenchmark
- #GPU
- #Memory Bandwidth
- Traverse Research uses microbenchmarks to measure GPU memory bandwidth for insights into hardware performance.
- GPU memory access is complex, involving descriptors, different buffer types (Byte Address, Structured, Typed), and texture units.
- Memory hierarchy in GPUs includes multiple cache levels (L0, L1, L2, etc.) to balance size and performance.
- Caches use write-back, write-through, or write-around strategies, with write-back being common for write combining.
- GPUs hide latency by keeping more threads in flight, but excessive threads can cause cache thrashing.
- Microbenchmark design involves reading/writing data in loops, avoiding cache hits, and testing different element sizes.
- Findings include performance differences between textures and buffers, and bottlenecks with specific data types.
- Qualcomm Adreno 740 (Meta Quest 3) shows significant bandwidth differences between buffers and textures in main memory.
- AMD Radeon RX 9070 XT has fast L0 cache (20 TiB/s for buffers, 11 TiB/s for textures) and ALU bottlenecks with integer operations.
- Intel Arc B580 shows varying bandwidth for different data types when loading from textures vs buffers.
- NVIDIA GeForce RTX 5070 Ti experiences bottlenecks with many writes to the same memory and differences in uint1 vs float1 loads.
- Microbenchmarks reveal hardware-specific optimizations and mysteries, guiding performance tuning for specific GPUs.