Hasty Briefsbeta

Bilingual

Matrix Orthogonalization Improves Memory in Recurrent Models

2 days ago
  • #recurrent neural networks
  • #orthogonalization
  • #associative recall
  • Transformers excel at associative recall due to direct token access, a challenge for RNNs in applications like long-horizon RL.
  • mLSTM, a variant with matrix memory, shows improved recall on MQAR, but noisy associative recall (NAR) is a better proxy for noisy environments.
  • Noisy AR tasks in MAD use distinct token ranges for keys, values, and distractors, requiring models to recall mappings despite interference.
  • Muon optimizer's orthogonalization equalizes update directions, preventing dominant directions from crowding out weaker memories, aiding tail-end associative learning.
  • Orthogonalizing mLSTM memory matrix during reads (not writes) with Newton-Schulz iterations improves NAR success rates and accuracy, especially in difficult vocab-96 tasks.
  • Orthogonalization enhances performance from near-failure to reliable levels in challenging settings, but results are limited to small models and synthetic tasks; real-world translation needs investigation.