Matrix Orthogonalization Improves Memory in Recurrent Models

2 days ago

Transformers excel at associative recall due to direct token access, a challenge for RNNs in applications like long-horizon RL.
mLSTM, a variant with matrix memory, shows improved recall on MQAR, but noisy associative recall (NAR) is a better proxy for noisy environments.
Noisy AR tasks in MAD use distinct token ranges for keys, values, and distractors, requiring models to recall mappings despite interference.
Muon optimizer's orthogonalization equalizes update directions, preventing dominant directions from crowding out weaker memories, aiding tail-end associative learning.
Orthogonalizing mLSTM memory matrix during reads (not writes) with Newton-Schulz iterations improves NAR success rates and accuracy, especially in difficult vocab-96 tasks.
Orthogonalization enhances performance from near-failure to reliable levels in challenging settings, but results are limited to small models and synthetic tasks; real-world translation needs investigation.

Hasty Briefsbeta