Hasty Briefsbeta

Bilingual

A Theory of Deep Learning

a day ago
  • #Deep Learning Theory
  • #Generalization
  • #Statistical Learning
  • Deep learning challenges classical statistical theory, where overparameterized networks can perfectly fit training data (even noise) yet still generalize well, a phenomenon called benign overfitting.
  • Key puzzles include double descent (test error rises then falls with model complexity), implicit bias (gradient descent prefers low-norm solutions), and grokking (networks memorize first, then generalize later).
  • A proposed theory analyzes networks in output space using the empirical Neural Tangent Kernel (eNTK), decomposing learning into a signal channel (where loss dissipates) and a reservoir (where noise is trapped and test-invisible).
  • This framework unifies explanations: noise in the reservoir explains benign overfitting, noise movement explains double descent, spectral learning order explains implicit bias, and signal migration explains grokking.
  • The theory enables training directly on population risk with a simple algorithm that updates parameters based on signal vs. noise, improving efficiency and eliminating validation sets.
  • Future directions include optimizing training by jumping to solved states, targeting generalization natively, and designing smaller models that mimic the noise-sequestering benefits of overparameterization.