Towards Greater Leverage: Scaling Laws for Efficient MoE Language Models

5 days ago

Copy Link

Mixture-of-Experts (MoE) architecture efficiently scales Large Language Models (LLMs) by decoupling parameters from computational cost.
Efficiency Leverage (EL) metric introduced to quantify computational advantage of MoE models over dense equivalents.
Large-scale empirical study with over 300 models up to 28B parameters reveals EL driven by expert activation ratio and compute budget, following power laws.
Expert granularity acts as a non-linear modulator with an optimal range.
Unified scaling law accurately predicts EL based on MoE configuration.
Ling-mini-beta, a 0.85B active parameter MoE model, matches performance of a 6.1B dense model with 7x fewer computational resources.
Scaling laws validated through empirical results, providing a foundation for efficient MoE model scaling.

Hasty Briefsbeta