Coding Neon Kernels for the Cortex-A53

a year ago

The post details the optimization of a NEON assembly kernel for the Cortex-A53, focusing on the operation y[n] = ax[n] + b.
Cortex-A53 lacks public documentation on instruction timing, making optimization reliant on folklore and micro-benchmarks.
The Cortex-A53 is an in-order CPU with partial dual-issue capabilities, making instruction timing predictable but sensitive to code order.
NEON instructions on Cortex-A53 can dual-issue under certain conditions, particularly when operating on 64-bit halves of independent NEON registers.
The load data path to L1d cache is 64 bits, while the store data path is 128 bits, influencing how data movement is optimized.
Theoretical maximum performance for the kernel is one y[n] calculation per clock cycle, achieving 2 FLOPs/cycle.
Optimization strategies include matching 64-bit load paths to 128-bit arithmetic and store paths, and pipelining to hide result latency.
Prefetching with prfm instructions is used to minimize L1d cache misses, improving performance without additional cycle cost.
Hand-written assembly significantly outperforms LLVM-generated code, which fails to optimize for Cortex-A53's specific characteristics.
The complete kernel implementation includes a prologue, loop, and epilogue, with prefetching, achieving near theoretical maximum performance.

Hasty Briefsbeta