Hasty Briefsbeta

Bilingual

SUS backprop: linear backpropagation algorithm for long inputs in transformers

a year ago
  • #Transformers
  • #Machine Learning
  • #Backpropagation
  • Unbiased gradient estimator can stochastically cut backpropagation flow through parts of a computational graph.
  • Attention mechanism in transformers is a limiting factor for long sequences due to quadratic compute requirements.
  • Most attention weights become very small, making them targets for cutting backpropagation.
  • Proposed probabilistic rule cuts backpropagation through most attention weights, reducing compute from quadratic to linear complexity.
  • Empirical verification shows cutting 99% of attention gradient flow results in minimal gradient variance increase.
  • Efficient sparse matrix implementation makes backward pass cost negligible relative to forward pass for long sequences.