Surprising Effectiveness of Masking Updates in Adaptive Optimizers

3 months ago

Masking parameter updates in adaptive optimizers can be highly effective.
A masked variant of RMSProp outperforms recent state-of-the-art optimizers.
Random masking induces curvature-dependent geometric regularization, smoothing the optimization trajectory.
Momentum-aligned gradient masking (Magma) is introduced as a simple drop-in replacement for adaptive optimizers.
Magma shows consistent gains in LLM pre-training with negligible computational overhead.
For 1B model size, Magma reduces perplexity by over 19% compared to Adam and 9% compared to Muon.

Hasty Briefsbeta