Pipelining and prefetching: a 45% speedup story

10 days ago

Copy Link

The blog post discusses a performance optimization technique involving pipelining and prefetching to speed up memory access in a Rubik's cube solver.
The main computational step is a depth-first search on an implicit graph, using a heuristic with data from large pruning tables to skip unnecessary branches.
Random memory access is slow, but prefetching allows the CPU to fetch data in advance while performing other operations, improving performance.
The optimization involves expanding multiple neighbor nodes 'in parallel' to give the CPU time to prefetch data, resulting in a 33% to 45% speedup.
The solver uses an iterative-deepening depth-first search (IDDFS) combined with IDA* (iterative-deepening A*) algorithm for efficiency.
Pruning tables provide lower bounds for the number of moves needed to solve a cube configuration, helping to skip unpromising branches early.
The old version accessed pruning table values immediately after computing indices, causing stalls. The new version computes indices for all neighbors first, allowing prefetching.
Manual prefetching instructions (e.g., `__builtin_prefetch`) provided an additional 10% to 20% speedup on top of the pipelining optimization.
Benchmarks show significant performance improvements across different pruning table sizes, with larger tables benefiting less due to fewer lookups.
The technique could be extended further by prefetching data for nodes deeper in the search tree, though this would require breaking the recursive DFS structure.

Hasty Briefsbeta