Hasty Briefsbeta

Pipelining and prefetching: a 45% speedup story

10 days ago
  • #memory-access
  • #performance-optimization
  • #rubiks-cube-solver
  • The blog post discusses a performance optimization technique involving pipelining and prefetching to speed up memory access in a Rubik's cube solver.
  • The main computational step is a depth-first search on an implicit graph, using a heuristic with data from large pruning tables to skip unnecessary branches.
  • Random memory access is slow, but prefetching allows the CPU to fetch data in advance while performing other operations, improving performance.
  • The optimization involves expanding multiple neighbor nodes 'in parallel' to give the CPU time to prefetch data, resulting in a 33% to 45% speedup.
  • The solver uses an iterative-deepening depth-first search (IDDFS) combined with IDA* (iterative-deepening A*) algorithm for efficiency.
  • Pruning tables provide lower bounds for the number of moves needed to solve a cube configuration, helping to skip unpromising branches early.
  • The old version accessed pruning table values immediately after computing indices, causing stalls. The new version computes indices for all neighbors first, allowing prefetching.
  • Manual prefetching instructions (e.g., `__builtin_prefetch`) provided an additional 10% to 20% speedup on top of the pipelining optimization.
  • Benchmarks show significant performance improvements across different pruning table sizes, with larger tables benefiting less due to fewer lookups.
  • The technique could be extended further by prefetching data for nodes deeper in the search tree, though this would require breaking the recursive DFS structure.