A Theory of Deep Learning

a day ago

Deep learning challenges classical statistical theory, where overparameterized networks can perfectly fit training data (even noise) yet still generalize well, a phenomenon called benign overfitting.
Key puzzles include double descent (test error rises then falls with model complexity), implicit bias (gradient descent prefers low-norm solutions), and grokking (networks memorize first, then generalize later).
A proposed theory analyzes networks in output space using the empirical Neural Tangent Kernel (eNTK), decomposing learning into a signal channel (where loss dissipates) and a reservoir (where noise is trapped and test-invisible).
This framework unifies explanations: noise in the reservoir explains benign overfitting, noise movement explains double descent, spectral learning order explains implicit bias, and signal migration explains grokking.
The theory enables training directly on population risk with a simple algorithm that updates parameters based on signal vs. noise, improving efficiency and eliminating validation sets.
Future directions include optimizing training by jumping to solved states, targeting generalization natively, and designing smaller models that mimic the noise-sequestering benefits of overparameterization.

Hasty Briefsbeta