Hasty Briefsbeta

Bilingual

Generalized on-policy distillation with reward extrapolation

3 months ago
  • #reinforcement learning
  • #knowledge distillation
  • #machine learning
  • On-policy distillation (OPD) improves student performance by aligning student with teacher's logit distribution on student-generated trajectories.
  • Theoretical analysis shows OPD is a special case of dense KL-constrained RL with equal weighting of reward function and KL regularization.
  • Generalized On-Policy Distillation (G-OPD) extends OPD by introducing flexible reference models and a reward scaling factor.
  • Reward extrapolation (ExOPD), setting the reward scaling factor >1, consistently outperforms standard OPD across various teacher-student size pairings.
  • ExOPD enables students to surpass teacher performance boundaries when merging knowledge from domain experts.
  • In strong-to-weak distillation, reward correction using the teacher's base model as reference improves performance but requires access to pre-RL teacher variants.
  • Comprehensive experiments on math reasoning and code generation tasks validate the effectiveness of G-OPD and ExOPD.