The Continual Learning Problem

6 months ago

Memory layers are proposed as a solution for continual learning, enabling models to update parameters without catastrophic forgetting.
Memory layers allow for high-capacity, sparse updates, with only a small subset of parameters active during each forward pass.
Experiments show memory layers reduce forgetting significantly compared to full finetuning and LoRA, with only an 11% performance drop on NaturalQuestions versus 89% and 71% respectively.
Continual learning is framed as two subproblems: generalization (learning important bits from data) and integration (avoiding forgetting while updating knowledge).
Memory layers use sparse attention over a pool of learned keys and values, enabling targeted updates and efficient information storage.
Sparse memory finetuning leverages TF-IDF-like ranking to selectively update memory slots, minimizing interference with existing knowledge.
Memory architectures offer potential advantages for larger models, with opportunities for better interpretability and organization of knowledge.
Optimizer choice (e.g., SGD vs. AdamW) impacts continual learning performance, with SGD showing better results for sparse memory finetuning.
Future directions include scaling memory layers to larger models, improving benchmarks, and exploring hybrid approaches between memory layers and in-context learning.

Hasty Briefsbeta