How does gradient descent work?

2 days ago

Copy Link

The paper introduces a new analysis of gradient descent in deep learning, focusing on the 'edge of stability' (EOS) dynamics.
Traditional gradient descent analysis fails to capture deep learning dynamics, especially when the sharpness (largest Hessian eigenvalue) exceeds 2/η.
Gradient descent in deep learning often exits the stable region (sharpness < 2/η) but self-regulates back via oscillations that reduce sharpness.
A third-order Taylor expansion reveals that gradient descent has an implicit negative feedback mechanism that regulates sharpness via oscillations.
The paper proposes the 'central flow', a differential equation modeling the time-averaged trajectory of gradient descent, including at EOS.
Central flow accurately predicts the covariance of oscillations and matches the long-term path of gradient descent across various architectures.
The loss along the central flow is a hidden progress metric, decreasing monotonically, unlike the non-monotonic loss under gradient descent.
Empirical results show the central flow's predictions hold across different neural networks and tasks, validating its generality.

Hasty Briefsbeta