Why Momentum Works (2017)
a year ago
- #momentum
- #gradient descent
- #optimization
- Gradient descent is a method for optimization, described as a man walking down a hill, following the steepest path downwards with slow but steady progress.
- Momentum is introduced as a modification to gradient descent, likened to a heavy ball rolling down the hill, which smooths and accelerates the descent, helping to overcome oscillations and local minima.
- The standard explanation of momentum lacks depth in explaining its behaviors, suggesting a need for a more precise model to understand its dynamics.
- The convex quadratic model is proposed as a balance between simplicity and richness, allowing for a closed-form understanding of momentum's local dynamics.
- Gradient descent's limitations are highlighted, including its slow convergence and susceptibility to pathological curvature, which leads to slow progress in certain directions.
- Momentum is presented as a solution to gradient descent's limitations, offering acceleration and improved convergence by introducing a memory term to the updates.
- Momentum's effectiveness is underscored by its quadratic speedup on many functions and its optimality in a technical sense, as per Nesterov's lower bound.
- The analysis extends to polynomial regression, illustrating how momentum and gradient descent interact with the problem's structure, particularly in terms of eigenfeatures and robustness.
- Early stopping is discussed as a heuristic that leverages the dynamics of optimization to prevent overfitting, akin to regularization methods.
- The limits of first-order optimization methods are explored, with a focus on the 'worst function in the world' scenario, demonstrating the inherent limitations of these methods.
- The article concludes by acknowledging the ongoing exploration of momentum's interpretations and the potential for further advancements in optimization techniques.