NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute
8 hours ago
- #Data Efficiency
- #NanoGPT
- #Ensemble Learning
- Achieved 10x data efficiency with NanoGPT Slowrun, contrary to current scaling laws.
- Ensembling is understudied in pretraining; training multiple models independently and aggregating predictions improves generalization.
- Training dynamics differ for ensembles; pushing models past individual optima benefits the ensemble.
- Chain distillation improves ensemble training, with memory efficiency by using only the preceding model as teacher.
- Generalization is theorized to relate to compression, with high weight decay and dropout used effectively.
- Looped transformers enhance inductive biases by allowing more compute per prediction, refining representations iteratively.
- Architectural tweaks like Exclusive Self Attention (XSA) contribute to data efficiency gains.
- Neural architecture search is crucial for data efficiency, with potential for 100x improvement within a year.