Hasty Briefsbeta

How does gradient descent work?

3 days ago
  • #deep learning
  • #gradient descent
  • #optimization
  • The paper introduces a new analysis of gradient descent in deep learning, focusing on the 'edge of stability' (EOS) dynamics.
  • Traditional gradient descent analysis fails to capture deep learning dynamics, especially when the sharpness (largest Hessian eigenvalue) exceeds 2/η.
  • Gradient descent in deep learning often exits the stable region (sharpness < 2/η) but self-regulates back via oscillations that reduce sharpness.
  • A third-order Taylor expansion reveals that gradient descent has an implicit negative feedback mechanism that regulates sharpness via oscillations.
  • The paper proposes the 'central flow', a differential equation modeling the time-averaged trajectory of gradient descent, including at EOS.
  • Central flow accurately predicts the covariance of oscillations and matches the long-term path of gradient descent across various architectures.
  • The loss along the central flow is a hidden progress metric, decreasing monotonically, unlike the non-monotonic loss under gradient descent.
  • Empirical results show the central flow's predictions hold across different neural networks and tasks, validating its generality.