Distillation Scaling Laws
9 days ago
- #model distillation
- #scaling laws
- #machine learning
- Proposes a distillation scaling law to estimate distilled model performance based on compute budget and allocation between student and teacher.
- Mitigates risks of large-scale distillation by enabling compute-optimal allocation for maximizing student performance.
- Provides compute-optimal distillation recipes for scenarios with an existing teacher or when a teacher needs training.
- Finds distillation outperforms supervised learning in settings with many students or an existing teacher, up to a predictable compute level.
- Indicates supervised learning is preferable if only one student is to be distilled and a teacher requires training.
- Enhances understanding of distillation through large-scale study, aiding experimental design.