Hasty Briefsbeta

Pre-training under infinite compute

17 days ago
  • #pre-training
  • #machine-learning
  • #data-efficiency
  • Pre-training under infinite compute explores optimizing language model pre-training with fixed data and no compute constraints.
  • Existing data-constrained approaches like increasing epoch count and parameter count eventually overfit.
  • Optimal weight decay is found to be 30× larger than standard practice, improving regularization.
  • Ensembling independently trained models achieves a lower loss asymptote than regularized recipes.
  • Combining epoching, regularization, parameter scaling, and ensemble scaling achieves a 5.17× data efficiency improvement.
  • Distilling an ensemble into a smaller student model retains 83% of the ensembling benefit.
  • Interventions generalize to downstream benchmarks, showing 9% improvement in pre-training evals and 17.5× data efficiency in math tasks.
  • Simple algorithmic improvements can enable significantly more data-efficient pre-training in a compute-rich future.