Scaling Laws, Honestly

16 hours ago

The original Kaplan et al. scaling laws were incorrect due to a bug involving fixed training data amounts and a cosine-decayed learning rate schedule.
Chinchilla scaling laws corrected this by showing models should be trained on more data relative to size, leading to smaller, more efficient models like Chinchilla vs. GPT-3.
Scaling laws are language-contingent; Chinchilla's optimal data-to-parameter ratio is specific to English, with morphologically richer languages requiring fewer tokens for efficiency.
Non-big-lab researchers should focus on exploring language-dependent scaling effects, as controlled experiments (e.g., with French) are affordable and reveal significant differences.

Hasty Briefsbeta