Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking
17 hours ago
- #Feature Learning
- #Machine Learning
- #Grokking
- The paper introduces a novel framework, $\mathbf{Li_2}$, to study grokking behavior in 2-layer nonlinear networks.
- It identifies three key stages of grokking: Lazy learning, Independent feature learning, and Interactive feature learning.
- Lazy learning involves top layer overfitting to random hidden representations, leading to memorization.
- Backpropagated gradient $G_F$ carries label information, enabling independent feature learning by hidden nodes.
- Independent dynamics follow gradient ascent of an energy function $E$, with local maxima as emerging features.
- The study examines generalizability, representation power, and sample size effects on feature emergence.
- Interactive learning stage shows $G_F$ focusing on missing features.
- Analysis reveals roles of hyperparameters like weight decay, learning rate, and sample sizes in grokking.
- Provides provable scaling laws for feature emergence, memorization, and generalization.
- Explains effectiveness of optimizers like Muon from gradient dynamics principles.
- Framework extendable to multi-layer architectures.