Anukari on the CPU (part 2: CPU optimization)
7 days ago
- #Performance Tuning
- #CPU Optimization
- #SIMD
- Initial CPU implementation of Anukari was only 5x slower than GPU, contrary to the expected 100x.
- First optimization involved spot-vectorization using SIMD intrinsics for float3 operations, improving speed but not matching GPU performance.
- Second approach focused on restructuring data for compiler auto-vectorization, making CPU performance comparable to GPU.
- Third approach shifted to manual vectorization via intrinsics due to compiler limitations, further improving CPU speed over GPU.
- Final optimization involved bespoke single-pass intrinsics, eliminating multi-pass overhead and leveraging pipelining for significant speedups.
- Comparison of Structure of Arrays (SoA) vs. Array of Structs (AoS) memory layouts revealed AoS was faster despite SIMD lane wastage, due to better cache utilization.
- Prefetching experiments did not yield performance improvements, possibly due to already efficient pipelining.
- Upcoming final post will discuss lessons learned and reflections on the optimization process.