Hasty Briefsbeta

Bilingual

NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute

6 hours ago
  • #Data Efficiency
  • #NanoGPT
  • #Ensemble Learning
  • Achieved 10x data efficiency with NanoGPT Slowrun, contrary to current scaling laws.
  • Ensembling is understudied in pretraining; training multiple models independently and aggregating predictions improves generalization.
  • Training dynamics differ for ensembles; pushing models past individual optima benefits the ensemble.
  • Chain distillation improves ensemble training, with memory efficiency by using only the preceding model as teacher.
  • Generalization is theorized to relate to compression, with high weight decay and dropout used effectively.
  • Looped transformers enhance inductive biases by allowing more compute per prediction, refining representations iteratively.
  • Architectural tweaks like Exclusive Self Attention (XSA) contribute to data efficiency gains.
  • Neural architecture search is crucial for data efficiency, with potential for 100x improvement within a year.