Coding Neon Kernels for the Cortex-A53
a year ago
- #Optimization
- #NEON
- #Cortex-A53
- The post details the optimization of a NEON assembly kernel for the Cortex-A53, focusing on the operation y[n] = ax[n] + b.
- Cortex-A53 lacks public documentation on instruction timing, making optimization reliant on folklore and micro-benchmarks.
- The Cortex-A53 is an in-order CPU with partial dual-issue capabilities, making instruction timing predictable but sensitive to code order.
- NEON instructions on Cortex-A53 can dual-issue under certain conditions, particularly when operating on 64-bit halves of independent NEON registers.
- The load data path to L1d cache is 64 bits, while the store data path is 128 bits, influencing how data movement is optimized.
- Theoretical maximum performance for the kernel is one y[n] calculation per clock cycle, achieving 2 FLOPs/cycle.
- Optimization strategies include matching 64-bit load paths to 128-bit arithmetic and store paths, and pipelining to hide result latency.
- Prefetching with prfm instructions is used to minimize L1d cache misses, improving performance without additional cycle cost.
- Hand-written assembly significantly outperforms LLVM-generated code, which fails to optimize for Cortex-A53's specific characteristics.
- The complete kernel implementation includes a prologue, loop, and epilogue, with prefetching, achieving near theoretical maximum performance.