Demystifying ARM SME to Optimize General Matrix Multiplications

5 days ago

Copy Link

The paper introduces MpGEMM, an open-source library optimized for General Matrix Multiplication (GEMM) on ARM's Scalable Matrix Extension (SME).
MpGEMM leverages cache-aware partitioning, efficient data packing, and specialized micro-kernels to maximize performance.
The library achieves a 1.23x speedup over Apple's Accelerate library and outperforms other open-source alternatives in real-world workloads.
Optimization techniques include on-the-fly transposition and utilization of multi-vector loads and tile registers.
Evaluated on Apple M4 Pro with workloads from DeepSeek and LLaMA.

Hasty Briefsbeta