Hasty Briefsbeta

Bilingual

Scaling Laws, Carefully

13 hours ago
  • #deep-learning
  • #scaling-laws
  • #machine-learning
  • Scaling laws describe predictable power-law decreases in training loss as model size (N), dataset size (D), and compute (C) scale up.
  • Early work established that generalization error scales as a power law with data and model size, with exponents varying by learning setup and domain.
  • Kaplan et al. (2020) formalized scaling laws for Transformers, showing loss scales as L = (N_α^ / N)^α + (D_β^ / D)^β + E, and initially suggested model size should grow faster than data under compute constraints.
  • Chinchilla (Hoffmann et al., 2022) revised this, finding optimal model size and data should scale equally (N_opt ∝ C^0.5), indicating many large models were undertrained.
  • Discrepancies between Kaplan and Chinchilla arise from differences in model scale, parameter counting (including embeddings), and fitting methods.
  • Scaling laws in data-limited regions incorporate repetition effects, with models like Muennighoff et al. and Lovelace et al. adding penalties for overfitting when data is repeated.
  • Fitting scaling laws is sensitive to choices like loss precision, noise, fitting region, and architectural consistency, affecting extrapolation accuracy.