On-Policy Distillation
6 months ago
- #reinforcement-learning
- #machine-learning
- #distillation
- LLMs achieve expert performance through a combination of pre-training, mid-training, and post-training stages.
- Smaller models with specialized training can outperform larger generalist models in specific domains.
- On-policy training involves sampling from the student model and assigning rewards, while off-policy training relies on external target outputs.
- On-policy distillation combines the relevance of RL with the dense reward signal of distillation, grading each token of the student's trajectory.
- On-policy distillation is shown to be more compute-efficient than RL, achieving similar performance with fewer steps.
- Distillation can effectively reuse training data, allowing multiple training epochs on the same prompt without overfitting.
- On-policy distillation is useful for continual learning, enabling models to acquire new knowledge without degrading prior capabilities.
- The method is applied to tasks like mathematical reasoning and personalized assistant training, demonstrating its versatility.