Hasty Briefsbeta

Bilingual

Scaling Laws, Honestly

16 hours ago
  • #Scaling Laws
  • #AI Research
  • #Language Models
  • The original Kaplan et al. scaling laws were incorrect due to a bug involving fixed training data amounts and a cosine-decayed learning rate schedule.
  • Chinchilla scaling laws corrected this by showing models should be trained on more data relative to size, leading to smaller, more efficient models like Chinchilla vs. GPT-3.
  • Scaling laws are language-contingent; Chinchilla's optimal data-to-parameter ratio is specific to English, with morphologically richer languages requiring fewer tokens for efficiency.
  • Non-big-lab researchers should focus on exploring language-dependent scaling effects, as controlled experiments (e.g., with French) are affordable and reveal significant differences.