Demystifying ARM SME to Optimize General Matrix Multiplications
5 days ago
- #ARM SME
- #High-Performance Computing
- #Matrix Multiplication
- The paper introduces MpGEMM, an open-source library optimized for General Matrix Multiplication (GEMM) on ARM's Scalable Matrix Extension (SME).
- MpGEMM leverages cache-aware partitioning, efficient data packing, and specialized micro-kernels to maximize performance.
- The library achieves a 1.23x speedup over Apple's Accelerate library and outperforms other open-source alternatives in real-world workloads.
- Optimization techniques include on-the-fly transposition and utilization of multi-vector loads and tile registers.
- Evaluated on Apple M4 Pro with workloads from DeepSeek and LLaMA.