Hasty Briefsbeta

Bilingual

N times faster than C, Arm edition (2023)

a day ago
  • #simd
  • #assembly
  • #optimization
  • The blog post discusses optimizing a simple algorithm for counting 's' and 'p' characters in a byte array, initially written in C and then optimized using A64 assembly and Neon SIMD instructions.
  • Performance improvements are achieved through various strategies, including table-driven approaches, reducing load operations, and utilizing 128-bit integers for processing multiple bytes at once.
  • The use of Neon SIMD instructions significantly boosts performance, with techniques like vector accumulation and minimizing reduction operations leading to a 73× speedup over the basic implementation.
  • Further optimizations include assuming input consists only of 's' and 'p' characters to simplify checks, and reformulating the problem to leverage bitwise operations for counting.
  • A comparison with a highly optimized C implementation shows that compiler optimizations can sometimes outperform hand-written SIMD code, emphasizing the importance of writing compiler-friendly code.
  • Key lessons include the value of reformulating code for compiler optimization, checking generated assembly, and the potential need for raw assembly in cases where the compiler doesn't produce optimal code.