Scaling Laws, Carefully

13 hours ago

Scaling laws describe predictable power-law decreases in training loss as model size (N), dataset size (D), and compute (C) scale up.
Early work established that generalization error scales as a power law with data and model size, with exponents varying by learning setup and domain.
Kaplan et al. (2020) formalized scaling laws for Transformers, showing loss scales as L = (N_α^ / N)^α + (D_β^ / D)^β + E, and initially suggested model size should grow faster than data under compute constraints.
Chinchilla (Hoffmann et al., 2022) revised this, finding optimal model size and data should scale equally (N_opt ∝ C^0.5), indicating many large models were undertrained.
Discrepancies between Kaplan and Chinchilla arise from differences in model scale, parameter counting (including embeddings), and fitting methods.
Scaling laws in data-limited regions incorporate repetition effects, with models like Muennighoff et al. and Lovelace et al. adding penalties for overfitting when data is repeated.
Fitting scaling laws is sensitive to choices like loss precision, noise, fitting region, and architectural consistency, affecting extrapolation accuracy.

Hasty Briefsbeta