Pre-training under infinite compute
17 days ago
- #pre-training
- #machine-learning
- #data-efficiency
- Pre-training under infinite compute explores optimizing language model pre-training with fixed data and no compute constraints.
- Existing data-constrained approaches like increasing epoch count and parameter count eventually overfit.
- Optimal weight decay is found to be 30× larger than standard practice, improving regularization.
- Ensembling independently trained models achieves a lower loss asymptote than regularized recipes.
- Combining epoching, regularization, parameter scaling, and ensemble scaling achieves a 5.17× data efficiency improvement.
- Distilling an ensemble into a smaller student model retains 83% of the ensembling benefit.
- Interventions generalize to downstream benchmarks, showing 9% improvement in pre-training evals and 17.5× data efficiency in math tasks.
- Simple algorithmic improvements can enable significantly more data-efficient pre-training in a compute-rich future.