Hasty Briefsbeta

Modern Optimizers – An Alchemist's Notes on Deep Learning

16 days ago
  • #gradient-descent
  • #machine-learning
  • #optimization
  • Modern optimizers like Adam are the backbone of modern learning, but spectral-whitening methods claim to outperform Adam.
  • Gradient descent traditionally uses a Euclidean distance metric, but non-Euclidean metrics can improve optimization by accounting for parameter sensitivities.
  • The whitening metric is derived from the square-root of the Gauss-Newton matrix, offering a conservative estimate for optimization.
  • Natural gradient descent uses the Fisher information matrix, which is related to the whitening metric and ensures parameterization-invariant optimization.
  • Spectral norm descent relates to the whitening metric by projecting gradients onto orthogonal matrices, optimizing the maximum singular value.
  • Optimizers like Adam/RMSProp, Shampoo/SOAP/SPlus, PSGD, and Muon implement spectral-whitening methods with varying computational efficiencies.
  • Benchmarking shows that spectral-whitening optimizers like SOAP and Muon outperform Adam in terms of validation loss and steps-to-Adam ratio.
  • SOAP is the most effective per gradient-step, while Muon offers computational efficiency, suggesting a potential hybrid approach could be optimal.
  • No current optimization method reliably surpasses spectral-whitening optimizers, indicating a need for further innovation in optimization techniques.