Matrix Valued Residuals

2 months ago

The paper introduces Residual Matrix Transformers (RMT), replacing the traditional residual stream with an outer product memory matrix.
RMT allows scaling the residual stream size independently of compute and model size, improving performance.
RMT achieves the same loss as traditional transformers with fewer FLOPS (58% less), parameters (25% less), and training tokens (41% less).
RMT outperforms traditional transformers on downstream evaluations.
Theoretical analysis shows RMT enables more efficient scaling of the residual stream and better variance propagation properties.
Code for the project is available at a provided URL.

Hasty Briefsbeta