Hasty Briefsbeta

Bilingual

Surprising Effectiveness of Masking Updates in Adaptive Optimizers

5 days ago
  • #Optimization
  • #Machine Learning
  • #Large Language Models
  • Masking parameter updates in adaptive optimizers can be highly effective.
  • A masked variant of RMSProp outperforms recent state-of-the-art optimizers.
  • Random masking induces curvature-dependent geometric regularization, smoothing the optimization trajectory.
  • Momentum-aligned gradient masking (Magma) is introduced as a simple drop-in replacement for adaptive optimizers.
  • Magma shows consistent gains in LLM pre-training with negligible computational overhead.
  • For 1B model size, Magma reduces perplexity by over 19% compared to Adam and 9% compared to Muon.