Hasty Briefsbeta

Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

15 hours ago
  • #Feature Learning
  • #Machine Learning
  • #Grokking
  • The paper introduces a novel framework, $\mathbf{Li_2}$, to study grokking behavior in 2-layer nonlinear networks.
  • It identifies three key stages of grokking: Lazy learning, Independent feature learning, and Interactive feature learning.
  • Lazy learning involves top layer overfitting to random hidden representations, leading to memorization.
  • Backpropagated gradient $G_F$ carries label information, enabling independent feature learning by hidden nodes.
  • Independent dynamics follow gradient ascent of an energy function $E$, with local maxima as emerging features.
  • The study examines generalizability, representation power, and sample size effects on feature emergence.
  • Interactive learning stage shows $G_F$ focusing on missing features.
  • Analysis reveals roles of hyperparameters like weight decay, learning rate, and sample sizes in grokking.
  • Provides provable scaling laws for feature emergence, memorization, and generalization.
  • Explains effectiveness of optimizers like Muon from gradient dynamics principles.
  • Framework extendable to multi-layer architectures.