Hasty Briefsbeta

The 1B Token Challenge: Finding the Perfect Pre-Training Mix

7 days ago
  • #language-models
  • #machine-learning
  • #data-efficiency
  • Achieved 90%+ performance of GPT-2 with only 1/10th the training data (1B tokens vs 10B tokens).
  • Optimal dataset composition found: 50% finePDFs, 30% DCLM-baseline, 20% FineWeb-Edu.
  • Static mixing outperformed curriculum learning, avoiding catastrophic failures and being faster to train.
  • Key insights: validation-generalization tradeoff, hard cutoff catastrophes, and the importance of diversity.
  • Trained GPT-2-70M model with 70M parameters, achieving comparable performance to original GPT-2.
  • Model benchmarks showed near-identical performance with significant cost savings (~50x cheaper).
  • Released pre-training dataset collection and trained model for community use.