Scaling Laws, Carefully

a month ago

Scaling laws describe power-law relationships between training loss and model size (N), dataset size (D), and compute (C), essential for optimal compute allocation in deep learning.
Early works (e.g., Amari et al. 1992, Hestness et al. 2017) established power-law learning curves and that architecture affects offset but not exponent, which is domain-specific.
Kaplan et al. (2020) formalized scaling laws for Transformers, suggesting model size should grow faster than data (N_opt ∝ C^0.73), but this was later contested.
Chinchilla (Hoffmann et al. 2022) found optimal scaling with N_opt ∝ C^0.5, advocating for smaller models trained on more tokens, showing many large models were undertrained.
Discrepancies between Kaplan and Chinchilla arise from differences in scale, embedding parameter accounting, and fitting methods, later reconciled by Pearce & Song (2024).
Power laws may stem from data manifold dimensionality or quantized skill learning, but theoretical explanations remain incomplete.
In data-limited regimes, studies (e.g., Hernandez et al. 2022, Muennighoff et al. 2023) show repeated data harms efficiency, with models adjusted via effective data and overfitting penalties.
Lovelace et al. (2026) introduced an overfitting penalty term based on capacity ratio (N/U_D) and repetition, improving scaling law fits for constrained data.
Fitting scaling laws is tricky due to sensitivity to loss precision, noise, fitting region, and assumptions like fixed architecture and tuning, as seen in replication attempts (Besiroglu et al. 2024).

Hasty Briefsbeta

Scaling Laws, Carefully