The 1B Token Challenge: Finding the Perfect Pre-Training Mix

7 days ago

Copy Link

Achieved 90%+ performance of GPT-2 with only 1/10th the training data (1B tokens vs 10B tokens).
Optimal dataset composition found: 50% finePDFs, 30% DCLM-baseline, 20% FineWeb-Edu.
Static mixing outperformed curriculum learning, avoiding catastrophic failures and being faster to train.
Key insights: validation-generalization tradeoff, hard cutoff catastrophes, and the importance of diversity.
Trained GPT-2-70M model with 70M parameters, achieving comparable performance to original GPT-2.
Model benchmarks showed near-identical performance with significant cost savings (~50x cheaper).
Released pre-training dataset collection and trained model for community use.

Hasty Briefsbeta