Matrix Valued Residuals
17 hours ago
- #Transformers
- #Machine Learning
- #Neural Networks
- The paper introduces Residual Matrix Transformers (RMT), replacing the traditional residual stream with an outer product memory matrix.
- RMT allows scaling the residual stream size independently of compute and model size, improving performance.
- RMT achieves the same loss as traditional transformers with fewer FLOPS (58% less), parameters (25% less), and training tokens (41% less).
- RMT outperforms traditional transformers on downstream evaluations.
- Theoretical analysis shows RMT enables more efficient scaling of the residual stream and better variance propagation properties.
- Code for the project is available at a provided URL.