NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute

8 hours ago

Achieved 10x data efficiency with NanoGPT Slowrun, contrary to current scaling laws.
Ensembling is understudied in pretraining; training multiple models independently and aggregating predictions improves generalization.
Training dynamics differ for ensembles; pushing models past individual optima benefits the ensemble.
Chain distillation improves ensemble training, with memory efficiency by using only the preceding model as teacher.
Generalization is theorized to relate to compression, with high weight decay and dropout used effectively.
Looped transformers enhance inductive biases by allowing more compute per prediction, refining representations iteratively.
Architectural tweaks like Exclusive Self Attention (XSA) contribute to data efficiency gains.
Neural architecture search is crucial for data efficiency, with potential for 100x improvement within a year.

Hasty Briefsbeta