Matrix Orthogonalization Improves Memory in Recurrent Models
2 days ago
- #recurrent neural networks
- #orthogonalization
- #associative recall
- Transformers excel at associative recall due to direct token access, a challenge for RNNs in applications like long-horizon RL.
- mLSTM, a variant with matrix memory, shows improved recall on MQAR, but noisy associative recall (NAR) is a better proxy for noisy environments.
- Noisy AR tasks in MAD use distinct token ranges for keys, values, and distractors, requiring models to recall mappings despite interference.
- Muon optimizer's orthogonalization equalizes update directions, preventing dominant directions from crowding out weaker memories, aiding tail-end associative learning.
- Orthogonalizing mLSTM memory matrix during reads (not writes) with Newton-Schulz iterations improves NAR success rates and accuracy, especially in difficult vocab-96 tasks.
- Orthogonalization enhances performance from near-failure to reliable levels in challenging settings, but results are limited to small models and synthetic tasks; real-world translation needs investigation.