Hasty Briefsbeta

Pretraining with hierarchical memories separating long-tail and common knowledge

15 hours ago
  • #pretraining
  • #memory-augmentation
  • #language-models
  • Modern language models rely on scaling parameters for performance gains, but this approach is impractical for edge devices with limited memory and compute.
  • A memory-augmented architecture is introduced, featuring small language models that access large hierarchical parametric memory banks for world knowledge.
  • During pretraining and inference, a context-dependent memory block is fetched and added to the model, optimizing knowledge storage and retrieval.
  • The pretraining strategy separates long-tail world knowledge (stored in memory parameters) from common knowledge and reasoning abilities (handled by the small language model).
  • Experiments show a 160M-parameter model with an 18M-parameter memory block from a 4.6B memory bank performs comparably to a model with over 2x the parameters.
  • Hierarchical feed-forward memories are found to work robustly across transformer architectures, whether added during pretraining or post-hoc.