SUS backprop: linear backpropagation algorithm for long inputs in transformers

a year ago

Unbiased gradient estimator can stochastically cut backpropagation flow through parts of a computational graph.
Attention mechanism in transformers is a limiting factor for long sequences due to quadratic compute requirements.
Most attention weights become very small, making them targets for cutting backpropagation.
Proposed probabilistic rule cuts backpropagation through most attention weights, reducing compute from quadratic to linear complexity.
Empirical verification shows cutting 99% of attention gradient flow results in minimal gradient variance increase.
Efficient sparse matrix implementation makes backward pass cost negligible relative to forward pass for long sequences.

Hasty Briefsbeta