Distillation Scaling Laws

9 days ago

Copy Link

Proposes a distillation scaling law to estimate distilled model performance based on compute budget and allocation between student and teacher.
Mitigates risks of large-scale distillation by enabling compute-optimal allocation for maximizing student performance.
Provides compute-optimal distillation recipes for scenarios with an existing teacher or when a teacher needs training.
Finds distillation outperforms supervised learning in settings with many students or an existing teacher, up to a predictable compute level.
Indicates supervised learning is preferable if only one student is to be distilled and a teacher requires training.
Enhances understanding of distillation through large-scale study, aiding experimental design.

Hasty Briefsbeta