Modern Optimizers – An Alchemist's Notes on Deep Learning
16 days ago
- #gradient-descent
- #machine-learning
- #optimization
- Modern optimizers like Adam are the backbone of modern learning, but spectral-whitening methods claim to outperform Adam.
- Gradient descent traditionally uses a Euclidean distance metric, but non-Euclidean metrics can improve optimization by accounting for parameter sensitivities.
- The whitening metric is derived from the square-root of the Gauss-Newton matrix, offering a conservative estimate for optimization.
- Natural gradient descent uses the Fisher information matrix, which is related to the whitening metric and ensures parameterization-invariant optimization.
- Spectral norm descent relates to the whitening metric by projecting gradients onto orthogonal matrices, optimizing the maximum singular value.
- Optimizers like Adam/RMSProp, Shampoo/SOAP/SPlus, PSGD, and Muon implement spectral-whitening methods with varying computational efficiencies.
- Benchmarking shows that spectral-whitening optimizers like SOAP and Muon outperform Adam in terms of validation loss and steps-to-Adam ratio.
- SOAP is the most effective per gradient-step, while Muon offers computational efficiency, suggesting a potential hybrid approach could be optimal.
- No current optimization method reliably surpasses spectral-whitening optimizers, indicating a need for further innovation in optimization techniques.