Anukari on the CPU (part 2: CPU optimization)

7 days ago

Copy Link

Initial CPU implementation of Anukari was only 5x slower than GPU, contrary to the expected 100x.
First optimization involved spot-vectorization using SIMD intrinsics for float3 operations, improving speed but not matching GPU performance.
Second approach focused on restructuring data for compiler auto-vectorization, making CPU performance comparable to GPU.
Third approach shifted to manual vectorization via intrinsics due to compiler limitations, further improving CPU speed over GPU.
Final optimization involved bespoke single-pass intrinsics, eliminating multi-pass overhead and leveraging pipelining for significant speedups.
Comparison of Structure of Arrays (SoA) vs. Array of Structs (AoS) memory layouts revealed AoS was faster despite SIMD lane wastage, due to better cache utilization.
Prefetching experiments did not yield performance improvements, possibly due to already efficient pipelining.
Upcoming final post will discuss lessons learned and reflections on the optimization process.

Hasty Briefsbeta