A Theory of Generalization in Deep Learning

15 hours ago

Introduces a non-asymptotic theory of generalization in deep learning based on neural tangent kernel partitioning of output space into signal and noise directions.
Shows minibatch SGD accumulates coherent signal via linear drift while suppressing memorization into slow diffusion, enabling generalization even in full feature-learning regime.
Explains phenomena like benign overfitting, double descent, implicit bias, and grokking through this theoretical framework.
Derives an exact population-risk objective from a single training run, measuring noise in the signal channel, and implements it as an SNR preconditioner for Adam.
Demonstrates practical improvements: accelerates grokking 5x, suppresses memorization in PINNs/neural representations, and enhances DPO fine-tuning with noisy preferences.

Hasty Briefsbeta