LPLB: An early research stage MoE load balancer based on linear programming

7 days ago

Copy Link

LPLB is a parallel load balancer using linear programming to optimize workload distribution for MoE models.
It dynamically reorders experts, constructs replicas, and solves optimal token assignments for dynamic load balancing.
EPLB facilitates expert reordering, and workload statistics can be provided by the user or collected via torch.distributed.
LPLB implements a single-SM Interior Point Method (IPM) and uses NVIDIA's cuSolverDx and cuBLASDx libraries.
Prerequisites include CUDA Toolkit >= 12.6.3, DeepEP (recommended), and embedded EPLB.
LPLB extends EPLB to handle dynamic load imbalance in MoE training, focusing on per-batch fluctuations.
Redundant experts are linked to original experts, with edge capacities defined by token assignments.
LP optimization redistributes tokens to minimize imbalance within an expert-parallel group.
Workload synchronization is optimized using NVLINK and NVSHMEM, reducing communication overhead.
Current limitations include balancing only token count, solver latency, and potential suboptimal performance under extreme imbalance.
Topologies include Cube, Hypercube, and Torus, each suited for different GPU configurations.
Custom topologies can be explored by modifying the r2o matrix.

Hasty Briefsbeta