Generalized on-policy distillation with reward extrapolation

3 months ago

On-policy distillation (OPD) improves student performance by aligning student with teacher's logit distribution on student-generated trajectories.
Theoretical analysis shows OPD is a special case of dense KL-constrained RL with equal weighting of reward function and KL regularization.
Generalized On-Policy Distillation (G-OPD) extends OPD by introducing flexible reference models and a reward scaling factor.
Reward extrapolation (ExOPD), setting the reward scaling factor >1, consistently outperforms standard OPD across various teacher-student size pairings.
ExOPD enables students to surpass teacher performance boundaries when merging knowledge from domain experts.
In strong-to-weak distillation, reward correction using the teacher's base model as reference improves performance but requires access to pre-RL teacher variants.
Comprehensive experiments on math reasoning and code generation tasks validate the effectiveness of G-OPD and ExOPD.

Hasty Briefsbeta