Hasty Briefsbeta

Bilingual

Matrix Valued Residuals

18 hours ago
  • #Transformers
  • #Machine Learning
  • #Neural Networks
  • The paper introduces Residual Matrix Transformers (RMT), replacing the traditional residual stream with an outer product memory matrix.
  • RMT allows scaling the residual stream size independently of compute and model size, improving performance.
  • RMT achieves the same loss as traditional transformers with fewer FLOPS (58% less), parameters (25% less), and training tokens (41% less).
  • RMT outperforms traditional transformers on downstream evaluations.
  • Theoretical analysis shows RMT enables more efficient scaling of the residual stream and better variance propagation properties.
  • Code for the project is available at a provided URL.