Hasty Briefsbeta

LPLB: An early research stage MoE load balancer based on linear programming

7 days ago
  • #linear programming
  • #Mixture-of-Experts
  • #load balancing
  • LPLB is a parallel load balancer using linear programming to optimize workload distribution for MoE models.
  • It dynamically reorders experts, constructs replicas, and solves optimal token assignments for dynamic load balancing.
  • EPLB facilitates expert reordering, and workload statistics can be provided by the user or collected via torch.distributed.
  • LPLB implements a single-SM Interior Point Method (IPM) and uses NVIDIA's cuSolverDx and cuBLASDx libraries.
  • Prerequisites include CUDA Toolkit >= 12.6.3, DeepEP (recommended), and embedded EPLB.
  • LPLB extends EPLB to handle dynamic load imbalance in MoE training, focusing on per-batch fluctuations.
  • Redundant experts are linked to original experts, with edge capacities defined by token assignments.
  • LP optimization redistributes tokens to minimize imbalance within an expert-parallel group.
  • Workload synchronization is optimized using NVLINK and NVSHMEM, reducing communication overhead.
  • Current limitations include balancing only token count, solver latency, and potential suboptimal performance under extreme imbalance.
  • Topologies include Cube, Hypercube, and Torus, each suited for different GPU configurations.
  • Custom topologies can be explored by modifying the r2o matrix.