Generalized on-policy distillation with reward extrapolation
3 months ago
- #reinforcement learning
- #knowledge distillation
- #machine learning
- On-policy distillation (OPD) improves student performance by aligning student with teacher's logit distribution on student-generated trajectories.
- Theoretical analysis shows OPD is a special case of dense KL-constrained RL with equal weighting of reward function and KL regularization.
- Generalized On-Policy Distillation (G-OPD) extends OPD by introducing flexible reference models and a reward scaling factor.
- Reward extrapolation (ExOPD), setting the reward scaling factor >1, consistently outperforms standard OPD across various teacher-student size pairings.
- ExOPD enables students to surpass teacher performance boundaries when merging knowledge from domain experts.
- In strong-to-weak distillation, reward correction using the teacher's base model as reference improves performance but requires access to pre-RL teacher variants.
- Comprehensive experiments on math reasoning and code generation tasks validate the effectiveness of G-OPD and ExOPD.