FastTwoSum Is Faster Than TwoSum
a year ago
- #algorithms
- #floating-point
- #performance
- Floating-point arithmetic is imprecise, but algorithms like TwoSum and FastTwoSum can quantify the imprecision by computing both the sum and the rounding error.
- TwoSum uses six floating-point operations with a latency of 15–20 cycles, while FastTwoSum uses three operations with a latency of 9–12 cycles but requires the inputs to be sorted by absolute value.
- Branch misprediction penalties can make FastTwoSum slower than TwoSum if inputs are not sorted, but branchless conditional operations can mitigate this.
- Performance analysis on Apple M1 shows TwoSum at 15 cycles and FastTwoSum at 9 cycles, with sorted versions in between.
- On Intel Coffee Lake, the basic sort version of FastTwoSum performs exceptionally well with GCC, but Clang's performance varies due to branch generation.
- SIMD implementations did not significantly improve performance over branchless conditional moves.
- The best FastTwoSum variant is faster than TwoSum on both ARM and Intel, with speedups of 20% or more, but requires careful compiler optimization to avoid branches.
- A hardware implementation of TwoSum or equivalent could significantly reduce the cost of double-double addition.