FastTwoSum Is Faster Than TwoSum

a year ago

Floating-point arithmetic is imprecise, but algorithms like TwoSum and FastTwoSum can quantify the imprecision by computing both the sum and the rounding error.
TwoSum uses six floating-point operations with a latency of 15–20 cycles, while FastTwoSum uses three operations with a latency of 9–12 cycles but requires the inputs to be sorted by absolute value.
Branch misprediction penalties can make FastTwoSum slower than TwoSum if inputs are not sorted, but branchless conditional operations can mitigate this.
Performance analysis on Apple M1 shows TwoSum at 15 cycles and FastTwoSum at 9 cycles, with sorted versions in between.
On Intel Coffee Lake, the basic sort version of FastTwoSum performs exceptionally well with GCC, but Clang's performance varies due to branch generation.
SIMD implementations did not significantly improve performance over branchless conditional moves.
The best FastTwoSum variant is faster than TwoSum on both ARM and Intel, with speedups of 20% or more, but requires careful compiler optimization to avoid branches.
A hardware implementation of TwoSum or equivalent could significantly reduce the cost of double-double addition.

Hasty Briefsbeta