Scaling Laws, Carefully
6 days ago
- #scaling laws
- #deep learning
- #compute optimization
- Scaling laws describe power-law relationships between training loss and model size (N), dataset size (D), and compute (C), essential for optimal compute allocation in deep learning.
- Early works (e.g., Amari et al. 1992, Hestness et al. 2017) established power-law learning curves and that architecture affects offset but not exponent, which is domain-specific.
- Kaplan et al. (2020) formalized scaling laws for Transformers, suggesting model size should grow faster than data (N_opt ∝ C^0.73), but this was later contested.
- Chinchilla (Hoffmann et al. 2022) found optimal scaling with N_opt ∝ C^0.5, advocating for smaller models trained on more tokens, showing many large models were undertrained.
- Discrepancies between Kaplan and Chinchilla arise from differences in scale, embedding parameter accounting, and fitting methods, later reconciled by Pearce & Song (2024).
- Power laws may stem from data manifold dimensionality or quantized skill learning, but theoretical explanations remain incomplete.
- In data-limited regimes, studies (e.g., Hernandez et al. 2022, Muennighoff et al. 2023) show repeated data harms efficiency, with models adjusted via effective data and overfitting penalties.
- Lovelace et al. (2026) introduced an overfitting penalty term based on capacity ratio (N/U_D) and repetition, improving scaling law fits for constrained data.
- Fitting scaling laws is tricky due to sensitivity to loss precision, noise, fitting region, and assumptions like fixed architecture and tuning, as seen in replication attempts (Besiroglu et al. 2024).