N times faster than C, Arm edition (2023)
a day ago
- #simd
- #assembly
- #optimization
- The blog post discusses optimizing a simple algorithm for counting 's' and 'p' characters in a byte array, initially written in C and then optimized using A64 assembly and Neon SIMD instructions.
- Performance improvements are achieved through various strategies, including table-driven approaches, reducing load operations, and utilizing 128-bit integers for processing multiple bytes at once.
- The use of Neon SIMD instructions significantly boosts performance, with techniques like vector accumulation and minimizing reduction operations leading to a 73× speedup over the basic implementation.
- Further optimizations include assuming input consists only of 's' and 'p' characters to simplify checks, and reformulating the problem to leverage bitwise operations for counting.
- A comparison with a highly optimized C implementation shows that compiler optimizations can sometimes outperform hand-written SIMD code, emphasizing the importance of writing compiler-friendly code.
- Key lessons include the value of reformulating code for compiler optimization, checking generated assembly, and the potential need for raw assembly in cases where the compiler doesn't produce optimal code.