Why Momentum Works (2017)

a year ago

#momentum
#gradient descent
#optimization

Gradient descent is a method for optimization, described as a man walking down a hill, following the steepest path downwards with slow but steady progress.
Momentum is introduced as a modification to gradient descent, likened to a heavy ball rolling down the hill, which smooths and accelerates the descent, helping to overcome oscillations and local minima.
The standard explanation of momentum lacks depth in explaining its behaviors, suggesting a need for a more precise model to understand its dynamics.
The convex quadratic model is proposed as a balance between simplicity and richness, allowing for a closed-form understanding of momentum's local dynamics.
Gradient descent's limitations are highlighted, including its slow convergence and susceptibility to pathological curvature, which leads to slow progress in certain directions.
Momentum is presented as a solution to gradient descent's limitations, offering acceleration and improved convergence by introducing a memory term to the updates.
Momentum's effectiveness is underscored by its quadratic speedup on many functions and its optimality in a technical sense, as per Nesterov's lower bound.
The analysis extends to polynomial regression, illustrating how momentum and gradient descent interact with the problem's structure, particularly in terms of eigenfeatures and robustness.
Early stopping is discussed as a heuristic that leverages the dynamics of optimization to prevent overfitting, akin to regularization methods.
The limits of first-order optimization methods are explored, with a focus on the 'worst function in the world' scenario, demonstrating the inherent limitations of these methods.
The article concludes by acknowledging the ongoing exploration of momentum's interpretations and the potential for further advancements in optimization techniques.

Hasty Briefsbeta

Why Momentum Works (2017)