Surprising Effectiveness of Masking Updates in Adaptive Optimizers
5 days ago
- #Optimization
- #Machine Learning
- #Large Language Models
- Masking parameter updates in adaptive optimizers can be highly effective.
- A masked variant of RMSProp outperforms recent state-of-the-art optimizers.
- Random masking induces curvature-dependent geometric regularization, smoothing the optimization trajectory.
- Momentum-aligned gradient masking (Magma) is introduced as a simple drop-in replacement for adaptive optimizers.
- Magma shows consistent gains in LLM pre-training with negligible computational overhead.
- For 1B model size, Magma reduces perplexity by over 19% compared to Adam and 9% compared to Muon.