Slicing Is All You Need: Towards a Universal One-Sided Distributed MatMul

3 days ago

Copy Link

Introduces a universal one-sided algorithm for distributed matrix multiplication supporting all partitioning and replication factor combinations.
Uses slicing (index arithmetic) to compute overlapping tiles for local matrix multiplies, which can be executed directly or optimized further.
Implemented in a high-level C++-based PGAS framework enabling direct GPU-to-GPU communication via intra-node interconnects.
Performance evaluation shows competitiveness with PyTorch DTensor, a highly optimized distributed tensor library for AI models.

Hasty Briefsbeta