Modern Optimizers – An Alchemist's Notes on Deep Learning

16 days ago

Copy Link

Modern optimizers like Adam are the backbone of modern learning, but spectral-whitening methods claim to outperform Adam.
Gradient descent traditionally uses a Euclidean distance metric, but non-Euclidean metrics can improve optimization by accounting for parameter sensitivities.
The whitening metric is derived from the square-root of the Gauss-Newton matrix, offering a conservative estimate for optimization.
Natural gradient descent uses the Fisher information matrix, which is related to the whitening metric and ensures parameterization-invariant optimization.
Spectral norm descent relates to the whitening metric by projecting gradients onto orthogonal matrices, optimizing the maximum singular value.
Optimizers like Adam/RMSProp, Shampoo/SOAP/SPlus, PSGD, and Muon implement spectral-whitening methods with varying computational efficiencies.
Benchmarking shows that spectral-whitening optimizers like SOAP and Muon outperform Adam in terms of validation loss and steps-to-Adam ratio.
SOAP is the most effective per gradient-step, while Muon offers computational efficiency, suggesting a potential hybrid approach could be optimal.
No current optimization method reliably surpasses spectral-whitening optimizers, indicating a need for further innovation in optimization techniques.

Hasty Briefsbeta