Towards Greater Leverage: Scaling Laws for Efficient MoE Language Models
5 days ago
- #Scaling Laws
- #Mixture-of-Experts
- #Large Language Models
- Mixture-of-Experts (MoE) architecture efficiently scales Large Language Models (LLMs) by decoupling parameters from computational cost.
- Efficiency Leverage (EL) metric introduced to quantify computational advantage of MoE models over dense equivalents.
- Large-scale empirical study with over 300 models up to 28B parameters reveals EL driven by expert activation ratio and compute budget, following power laws.
- Expert granularity acts as a non-linear modulator with an optimal range.
- Unified scaling law accurately predicts EL based on MoE configuration.
- Ling-mini-beta, a 0.85B active parameter MoE model, matches performance of a 6.1B dense model with 7x fewer computational resources.
- Scaling laws validated through empirical results, providing a foundation for efficient MoE model scaling.