Lost in Backpropagation: The LM Head Is a Gradient Bottleneck
2 days ago
- #neural language models
- #gradient bottleneck
- #softmax bottleneck
- The last layer of neural language models (LMs) projects output features of dimension D to logits in dimension V (vocabulary size), where D << V.
- This mismatch creates a softmax bottleneck, which is not only an expressivity bottleneck but also an optimization bottleneck.
- Backpropagating V-dimensional gradients through a rank-D linear layer induces unavoidable compression, altering training feedback for most parameters.
- Empirical measurements show 95-99% of the gradient norm is suppressed by the output layer, leading to suboptimal update directions.
- Controlled pretraining experiments reveal the gradient bottleneck makes trivial patterns unlearnable and drastically affects LLM training dynamics.
- The inherent flaw contributes to training inefficiencies at scale, independent of model architecture, and calls for new LM head designs.