A Theory of Deep Learning
a day ago
- #Deep Learning Theory
- #Generalization
- #Statistical Learning
- Deep learning challenges classical statistical theory, where overparameterized networks can perfectly fit training data (even noise) yet still generalize well, a phenomenon called benign overfitting.
- Key puzzles include double descent (test error rises then falls with model complexity), implicit bias (gradient descent prefers low-norm solutions), and grokking (networks memorize first, then generalize later).
- A proposed theory analyzes networks in output space using the empirical Neural Tangent Kernel (eNTK), decomposing learning into a signal channel (where loss dissipates) and a reservoir (where noise is trapped and test-invisible).
- This framework unifies explanations: noise in the reservoir explains benign overfitting, noise movement explains double descent, spectral learning order explains implicit bias, and signal migration explains grokking.
- The theory enables training directly on population risk with a simple algorithm that updates parameters based on signal vs. noise, improving efficiency and eliminating validation sets.
- Future directions include optimizing training by jumping to solved states, targeting generalization natively, and designing smaller models that mimic the noise-sequestering benefits of overparameterization.