Hasty Briefsbeta

Towards Greater Leverage: Scaling Laws for Efficient MoE Language Models

5 days ago
  • #Scaling Laws
  • #Mixture-of-Experts
  • #Large Language Models
  • Mixture-of-Experts (MoE) architecture efficiently scales Large Language Models (LLMs) by decoupling parameters from computational cost.
  • Efficiency Leverage (EL) metric introduced to quantify computational advantage of MoE models over dense equivalents.
  • Large-scale empirical study with over 300 models up to 28B parameters reveals EL driven by expert activation ratio and compute budget, following power laws.
  • Expert granularity acts as a non-linear modulator with an optimal range.
  • Unified scaling law accurately predicts EL based on MoE configuration.
  • Ling-mini-beta, a 0.85B active parameter MoE model, matches performance of a 6.1B dense model with 7x fewer computational resources.
  • Scaling laws validated through empirical results, providing a foundation for efficient MoE model scaling.