NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

2 months ago

NanoGPT Slowrun is an open effort to improve data-efficient learning algorithms, achieving 5.5x data efficiency in the first week.
Current scaling laws require proportional increases in both data and compute, but data is becoming the bottleneck in fields like robotics and biology.
Q Labs aims to solve generalization by developing learning algorithms that work with limited data and practically infinite compute.
NanoGPT Slowrun trains on 100M tokens from FineWeb, allowing unlimited compute, with the goal of achieving the lowest validation loss.
Muon optimizer outperforms others like AdamW, SOAP, and MAGMA, with multi-epoch training and aggressive regularization being key factors.
Community contributions have improved data efficiency from 2.4x to 5.5x, with techniques like epoch shuffling, learned projections, and SwiGLU activation.
Potential future directions include second-order optimizers, diffusion models, curriculum learning, and gradient descent alternatives.
10x data efficiency seems achievable soon, with 100x being a feasible long-term goal.

Hasty Briefsbeta