SUS backprop: linear backpropagation algorithm for long inputs in transformers
a year ago
- #Transformers
- #Machine Learning
- #Backpropagation
- Unbiased gradient estimator can stochastically cut backpropagation flow through parts of a computational graph.
- Attention mechanism in transformers is a limiting factor for long sequences due to quadratic compute requirements.
- Most attention weights become very small, making them targets for cutting backpropagation.
- Proposed probabilistic rule cuts backpropagation through most attention weights, reducing compute from quadratic to linear complexity.
- Empirical verification shows cutting 99% of attention gradient flow results in minimal gradient variance increase.
- Efficient sparse matrix implementation makes backward pass cost negligible relative to forward pass for long sequences.