Pretraining with hierarchical memories separating long-tail and common knowledge
15 hours ago
- #pretraining
- #memory-augmentation
- #language-models
- Modern language models rely on scaling parameters for performance gains, but this approach is impractical for edge devices with limited memory and compute.
- A memory-augmented architecture is introduced, featuring small language models that access large hierarchical parametric memory banks for world knowledge.
- During pretraining and inference, a context-dependent memory block is fetched and added to the model, optimizing knowledge storage and retrieval.
- The pretraining strategy separates long-tail world knowledge (stored in memory parameters) from common knowledge and reasoning abilities (handled by the small language model).
- Experiments show a 160M-parameter model with an 18M-parameter memory block from a 4.6B memory bank performs comparably to a model with over 2x the parameters.
- Hierarchical feed-forward memories are found to work robustly across transformer architectures, whether added during pretraining or post-hoc.