How Much Linear Memory Access Is Enough?
2 days ago
- #Memory Performance
- #Block Size Optimization
- #CPU Cache Effects
- Linear memory access performance depends on block size, with diminishing returns beyond certain thresholds.
- For peak performance: 1 MB blocks are generally sufficient, 128 kB works for at least ~1 cycle per byte, and 4 kB is adequate for ~10+ cycles per byte.
- Experiments used kernels like scalar_stats (light processing), simd_sum (fast SIMD), and heavy_sin (heavy computation) to test block sizes.
- Cache clobbering and randomized layouts were used to simulate cold cache scenarios, while repeated runs modeled pre-warmed caches.
- Results show block size requirements vary with workload intensity, with heavier computations needing smaller blocks for peak performance.
- The findings apply to linear per-block processing; striding, allocations, or other per-block costs may shift the curves.