A Simpler Parametrization for Modern Optimizers
17 hours ago
- #normalization
- #optimization
- #machine-learning
- Modern normalized optimizers can be formulated as stochastic optimization on a product of RMS spheres, where each block's radius is fixed by initialization.
- Weight decay emerges naturally as a radial Lagrange multiplier to maintain the RMS constraint, eliminating the need for manual scheduling as a penalty.
- A single global direction half-life parameter (h) governs weight-direction retention, replacing traditional learning rates, weight decay schedules, and clipping thresholds.
- Spherical updates preserve block radii automatically using a retention-based angular step, with step size derived from the half-life and count increments.
- Momentum retention can be tied to weight-direction retention by default, simplifying the algorithm, though it can be untied if needed for different timescales.