Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

17 hours ago

Copy Link

The paper introduces a novel framework, $\mathbf{Li_2}$, to study grokking behavior in 2-layer nonlinear networks.
It identifies three key stages of grokking: Lazy learning, Independent feature learning, and Interactive feature learning.
Lazy learning involves top layer overfitting to random hidden representations, leading to memorization.
Backpropagated gradient $G_F$ carries label information, enabling independent feature learning by hidden nodes.
Independent dynamics follow gradient ascent of an energy function $E$, with local maxima as emerging features.
The study examines generalizability, representation power, and sample size effects on feature emergence.
Interactive learning stage shows $G_F$ focusing on missing features.
Analysis reveals roles of hyperparameters like weight decay, learning rate, and sample sizes in grokking.
Provides provable scaling laws for feature emergence, memorization, and generalization.
Explains effectiveness of optimizers like Muon from gradient dynamics principles.
Framework extendable to multi-layer architectures.

Hasty Briefsbeta