Hasty Briefsbeta

Bilingual

NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

5 hours ago
  • #machine-learning
  • #data-efficiency
  • #optimization
  • NanoGPT Slowrun is an open effort to improve data-efficient learning algorithms, achieving 5.5x data efficiency in the first week.
  • Current scaling laws require proportional increases in both data and compute, but data is becoming the bottleneck in fields like robotics and biology.
  • Q Labs aims to solve generalization by developing learning algorithms that work with limited data and practically infinite compute.
  • NanoGPT Slowrun trains on 100M tokens from FineWeb, allowing unlimited compute, with the goal of achieving the lowest validation loss.
  • Muon optimizer outperforms others like AdamW, SOAP, and MAGMA, with multi-epoch training and aggressive regularization being key factors.
  • Community contributions have improved data efficiency from 2.4x to 5.5x, with techniques like epoch shuffling, learned projections, and SwiGLU activation.
  • Potential future directions include second-order optimizers, diffusion models, curriculum learning, and gradient descent alternatives.
  • 10x data efficiency seems achievable soon, with 100x being a feasible long-term goal.