Pipelining and prefetching: a 45% speedup story
10 days ago
- #memory-access
- #performance-optimization
- #rubiks-cube-solver
- The blog post discusses a performance optimization technique involving pipelining and prefetching to speed up memory access in a Rubik's cube solver.
- The main computational step is a depth-first search on an implicit graph, using a heuristic with data from large pruning tables to skip unnecessary branches.
- Random memory access is slow, but prefetching allows the CPU to fetch data in advance while performing other operations, improving performance.
- The optimization involves expanding multiple neighbor nodes 'in parallel' to give the CPU time to prefetch data, resulting in a 33% to 45% speedup.
- The solver uses an iterative-deepening depth-first search (IDDFS) combined with IDA* (iterative-deepening A*) algorithm for efficiency.
- Pruning tables provide lower bounds for the number of moves needed to solve a cube configuration, helping to skip unpromising branches early.
- The old version accessed pruning table values immediately after computing indices, causing stalls. The new version computes indices for all neighbors first, allowing prefetching.
- Manual prefetching instructions (e.g., `__builtin_prefetch`) provided an additional 10% to 20% speedup on top of the pipelining optimization.
- Benchmarks show significant performance improvements across different pruning table sizes, with larger tables benefiting less due to fewer lookups.
- The technique could be extended further by prefetching data for nodes deeper in the search tree, though this would require breaking the recursive DFS structure.