Hasty Briefsbeta

Bilingual

Lost in Backpropagation: The LM Head Is a Gradient Bottleneck

2 days ago
  • #neural language models
  • #gradient bottleneck
  • #softmax bottleneck
  • The last layer of neural language models (LMs) projects output features of dimension D to logits in dimension V (vocabulary size), where D << V.
  • This mismatch creates a softmax bottleneck, which is not only an expressivity bottleneck but also an optimization bottleneck.
  • Backpropagating V-dimensional gradients through a rank-D linear layer induces unavoidable compression, altering training feedback for most parameters.
  • Empirical measurements show 95-99% of the gradient norm is suppressed by the output layer, leading to suboptimal update directions.
  • Controlled pretraining experiments reveal the gradient bottleneck makes trivial patterns unlearnable and drastically affects LLM training dynamics.
  • The inherent flaw contributes to training inefficiencies at scale, independent of model architecture, and calls for new LM head designs.