Hasty Briefsbeta

Distillation Scaling Laws

9 days ago
  • #model distillation
  • #scaling laws
  • #machine learning
  • Proposes a distillation scaling law to estimate distilled model performance based on compute budget and allocation between student and teacher.
  • Mitigates risks of large-scale distillation by enabling compute-optimal allocation for maximizing student performance.
  • Provides compute-optimal distillation recipes for scenarios with an existing teacher or when a teacher needs training.
  • Finds distillation outperforms supervised learning in settings with many students or an existing teacher, up to a predictable compute level.
  • Indicates supervised learning is preferable if only one student is to be distilled and a teacher requires training.
  • Enhances understanding of distillation through large-scale study, aiding experimental design.