Lost in Backpropagation: The LM Head Is a Gradient Bottleneck

2 days ago

The last layer of neural language models (LMs) projects output features of dimension D to logits in dimension V (vocabulary size), where D << V.
This mismatch creates a softmax bottleneck, which is not only an expressivity bottleneck but also an optimization bottleneck.
Backpropagating V-dimensional gradients through a rank-D linear layer induces unavoidable compression, altering training feedback for most parameters.
Empirical measurements show 95-99% of the gradient norm is suppressed by the output layer, leading to suboptimal update directions.
Controlled pretraining experiments reveal the gradient bottleneck makes trivial patterns unlearnable and drastically affects LLM training dynamics.
The inherent flaw contributes to training inefficiencies at scale, independent of model architecture, and calls for new LM head designs.

Hasty Briefsbeta