How does gradient descent work?
2 days ago
- #deep learning
- #gradient descent
- #optimization
- The paper introduces a new analysis of gradient descent in deep learning, focusing on the 'edge of stability' (EOS) dynamics.
- Traditional gradient descent analysis fails to capture deep learning dynamics, especially when the sharpness (largest Hessian eigenvalue) exceeds 2/η.
- Gradient descent in deep learning often exits the stable region (sharpness < 2/η) but self-regulates back via oscillations that reduce sharpness.
- A third-order Taylor expansion reveals that gradient descent has an implicit negative feedback mechanism that regulates sharpness via oscillations.
- The paper proposes the 'central flow', a differential equation modeling the time-averaged trajectory of gradient descent, including at EOS.
- Central flow accurately predicts the covariance of oscillations and matches the long-term path of gradient descent across various architectures.
- The loss along the central flow is a hidden progress metric, decreasing monotonically, unlike the non-monotonic loss under gradient descent.
- Empirical results show the central flow's predictions hold across different neural networks and tasks, validating its generality.