A Simpler Parametrization for Modern Optimizers

17 hours ago

Modern normalized optimizers can be formulated as stochastic optimization on a product of RMS spheres, where each block's radius is fixed by initialization.
Weight decay emerges naturally as a radial Lagrange multiplier to maintain the RMS constraint, eliminating the need for manual scheduling as a penalty.
A single global direction half-life parameter (h) governs weight-direction retention, replacing traditional learning rates, weight decay schedules, and clipping thresholds.
Spherical updates preserve block radii automatically using a retention-based angular step, with step size derived from the half-life and count increments.
Momentum retention can be tied to weight-direction retention by default, simplifying the algorithm, though it can be untied if needed for different timescales.

Hasty Briefsbeta