Scaling Laws, Carefully
13 hours ago
- #deep-learning
- #scaling-laws
- #machine-learning
- Scaling laws describe predictable power-law decreases in training loss as model size (N), dataset size (D), and compute (C) scale up.
- Early work established that generalization error scales as a power law with data and model size, with exponents varying by learning setup and domain.
- Kaplan et al. (2020) formalized scaling laws for Transformers, showing loss scales as L = (N_α^ / N)^α + (D_β^ / D)^β + E, and initially suggested model size should grow faster than data under compute constraints.
- Chinchilla (Hoffmann et al., 2022) revised this, finding optimal model size and data should scale equally (N_opt ∝ C^0.5), indicating many large models were undertrained.
- Discrepancies between Kaplan and Chinchilla arise from differences in model scale, parameter counting (including embeddings), and fitting methods.
- Scaling laws in data-limited regions incorporate repetition effects, with models like Muennighoff et al. and Lovelace et al. adding penalties for overfitting when data is repeated.
- Fitting scaling laws is sensitive to choices like loss precision, noise, fitting region, and architectural consistency, affecting extrapolation accuracy.