LPLB: An early research stage MoE load balancer based on linear programming
7 days ago
- #linear programming
- #Mixture-of-Experts
- #load balancing
- LPLB is a parallel load balancer using linear programming to optimize workload distribution for MoE models.
- It dynamically reorders experts, constructs replicas, and solves optimal token assignments for dynamic load balancing.
- EPLB facilitates expert reordering, and workload statistics can be provided by the user or collected via torch.distributed.
- LPLB implements a single-SM Interior Point Method (IPM) and uses NVIDIA's cuSolverDx and cuBLASDx libraries.
- Prerequisites include CUDA Toolkit >= 12.6.3, DeepEP (recommended), and embedded EPLB.
- LPLB extends EPLB to handle dynamic load imbalance in MoE training, focusing on per-batch fluctuations.
- Redundant experts are linked to original experts, with edge capacities defined by token assignments.
- LP optimization redistributes tokens to minimize imbalance within an expert-parallel group.
- Workload synchronization is optimized using NVLINK and NVSHMEM, reducing communication overhead.
- Current limitations include balancing only token count, solver latency, and potential suboptimal performance under extreme imbalance.
- Topologies include Cube, Hypercube, and Torus, each suited for different GPU configurations.
- Custom topologies can be explored by modifying the r2o matrix.