Scaling Laws, Honestly
16 hours ago
- #Scaling Laws
- #AI Research
- #Language Models
- The original Kaplan et al. scaling laws were incorrect due to a bug involving fixed training data amounts and a cosine-decayed learning rate schedule.
- Chinchilla scaling laws corrected this by showing models should be trained on more data relative to size, leading to smaller, more efficient models like Chinchilla vs. GPT-3.
- Scaling laws are language-contingent; Chinchilla's optimal data-to-parameter ratio is specific to English, with morphologically richer languages requiring fewer tokens for efficiency.
- Non-big-lab researchers should focus on exploring language-dependent scaling effects, as controlled experiments (e.g., with French) are affordable and reveal significant differences.