Hasty Briefsbeta

Demystifying ARM SME to Optimize General Matrix Multiplications

5 days ago
  • #ARM SME
  • #High-Performance Computing
  • #Matrix Multiplication
  • The paper introduces MpGEMM, an open-source library optimized for General Matrix Multiplication (GEMM) on ARM's Scalable Matrix Extension (SME).
  • MpGEMM leverages cache-aware partitioning, efficient data packing, and specialized micro-kernels to maximize performance.
  • The library achieves a 1.23x speedup over Apple's Accelerate library and outperforms other open-source alternatives in real-world workloads.
  • Optimization techniques include on-the-fly transposition and utilization of multi-vector loads and tile registers.
  • Evaluated on Apple M4 Pro with workloads from DeepSeek and LLaMA.