Slicing Is All You Need: Towards a Universal One-Sided Distributed MatMul
3 days ago
- #matrix multiplication
- #distributed computing
- #GPU communication
- Introduces a universal one-sided algorithm for distributed matrix multiplication supporting all partitioning and replication factor combinations.
- Uses slicing (index arithmetic) to compute overlapping tiles for local matrix multiplies, which can be executed directly or optimized further.
- Implemented in a high-level C++-based PGAS framework enabling direct GPU-to-GPU communication via intra-node interconnects.
- Performance evaluation shows competitiveness with PyTorch DTensor, a highly optimized distributed tensor library for AI models.