Why Your CPU Is Fast but Your Program Is Slow: Understanding the Memory Wall
4 days ago
- #cache-hierarchy
- #memory-wall
- #cpu-performance
- CPU executes billions of operations per second, but actual program speed often depends on memory access patterns.
- The 'Memory Wall' refers to the performance gap between fast CPUs and slower DRAM memory, with latency differences up to 100x.
- DRAM stores data in capacitors arranged in rows; accessing memory involves row activation and precharge, making random access patterns slow.
- Cache hierarchy (L1, L2, L3) mitigates memory latency by storing frequently used data closer to the CPU, with L1 being fastest and smallest.
- Stride scan experiments show performance drops sharply at a stride of 64 bytes due to cache line inefficiency, where every access becomes a cache miss.
- Programs can be memory-bound (limited by data access speed) or compute-bound (limited by CPU computation); many are actually memory-bound despite high CPU utilization.
- Optimizing memory access patterns (e.g., sequential access) is crucial for performance, as changing data movement often matters more than making the CPU faster.