The 1B Token Challenge: Finding the Perfect Pre-Training Mix
7 days ago
- #language-models
- #machine-learning
- #data-efficiency
- Achieved 90%+ performance of GPT-2 with only 1/10th the training data (1B tokens vs 10B tokens).
- Optimal dataset composition found: 50% finePDFs, 30% DCLM-baseline, 20% FineWeb-Edu.
- Static mixing outperformed curriculum learning, avoiding catastrophic failures and being faster to train.
- Key insights: validation-generalization tradeoff, hard cutoff catastrophes, and the importance of diversity.
- Trained GPT-2-70M model with 70M parameters, achieving comparable performance to original GPT-2.
- Model benchmarks showed near-identical performance with significant cost savings (~50x cheaper).
- Released pre-training dataset collection and trained model for community use.