NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
4 hours ago
- #machine-learning
- #data-efficiency
- #optimization
- NanoGPT Slowrun is an open effort to improve data-efficient learning algorithms, achieving 5.5x data efficiency in the first week.
- Current scaling laws require proportional increases in both data and compute, but data is becoming the bottleneck in fields like robotics and biology.
- Q Labs aims to solve generalization by developing learning algorithms that work with limited data and practically infinite compute.
- NanoGPT Slowrun trains on 100M tokens from FineWeb, allowing unlimited compute, with the goal of achieving the lowest validation loss.
- Muon optimizer outperforms others like AdamW, SOAP, and MAGMA, with multi-epoch training and aggressive regularization being key factors.
- Community contributions have improved data efficiency from 2.4x to 5.5x, with techniques like epoch shuffling, learned projections, and SwiGLU activation.
- Potential future directions include second-order optimizers, diffusion models, curriculum learning, and gradient descent alternatives.
- 10x data efficiency seems achievable soon, with 100x being a feasible long-term goal.