When O3 is 2x slower than O2
6 months ago
- #Optimization
- #Rust
- #Performance
- The article discusses a performance regression when optimizing a custom bounded priority queue in Rust, specifically when changing the optimization level from `opt-level=2` to `opt-level=3`.
- Benchmark results show a +123% performance penalty with `opt-level=3` compared to `opt-level=2`.
- The queue uses a sorted `Vec` with binary search for insertion, and the performance issue is traced back to the compare function's behavior under different optimization levels.
- Flamegraph analysis reveals that `binary_search_by` and the compare function consume significantly more CPU time with `opt-level=3`.
- Assembly code inspection shows that `opt-level=3` replaces conditional jumps with conditional moves, which theoretically should be faster but in practice leads to worse performance due to dependency bottlenecks.
- Alternative compare functions, such as using `f32::total_cmp`, are tested but still result in slower performance with `opt-level=3`.
- Experiments with modifying `binary_search_by` (e.g., adding `hint::black_box` or replacing `hint::select_unpredictable`) yield mixed results, sometimes improving performance but not consistently.
- Historical context from Rust's issue tracker indicates that conditional moves were introduced to improve performance in certain cases, but they can regress performance in others.
- The article concludes that benchmarking is complex and performance outcomes can vary widely depending on the specific use case and optimization strategies.