Hasty Briefsbeta

Bilingual

A Simpler Parametrization for Modern Optimizers

17 hours ago
  • #normalization
  • #optimization
  • #machine-learning
  • Modern normalized optimizers can be formulated as stochastic optimization on a product of RMS spheres, where each block's radius is fixed by initialization.
  • Weight decay emerges naturally as a radial Lagrange multiplier to maintain the RMS constraint, eliminating the need for manual scheduling as a penalty.
  • A single global direction half-life parameter (h) governs weight-direction retention, replacing traditional learning rates, weight decay schedules, and clipping thresholds.
  • Spherical updates preserve block radii automatically using a retention-based angular step, with step size derived from the half-life and count increments.
  • Momentum retention can be tied to weight-direction retention by default, simplifying the algorithm, though it can be untied if needed for different timescales.