Pretraining with hierarchical memories separating long-tail and common knowledge

15 hours ago

Copy Link

Modern language models rely on scaling parameters for performance gains, but this approach is impractical for edge devices with limited memory and compute.
A memory-augmented architecture is introduced, featuring small language models that access large hierarchical parametric memory banks for world knowledge.
During pretraining and inference, a context-dependent memory block is fetched and added to the model, optimizing knowledge storage and retrieval.
The pretraining strategy separates long-tail world knowledge (stored in memory parameters) from common knowledge and reasoning abilities (handled by the small language model).
Experiments show a 160M-parameter model with an 18M-parameter memory block from a 4.6B memory bank performs comparably to a model with over 2x the parameters.
Hierarchical feed-forward memories are found to work robustly across transformer architectures, whether added during pretraining or post-hoc.

Hasty Briefsbeta