On-Policy Distillation

6 months ago

LLMs achieve expert performance through a combination of pre-training, mid-training, and post-training stages.
Smaller models with specialized training can outperform larger generalist models in specific domains.
On-policy training involves sampling from the student model and assigning rewards, while off-policy training relies on external target outputs.
On-policy distillation combines the relevance of RL with the dense reward signal of distillation, grading each token of the student's trajectory.
On-policy distillation is shown to be more compute-efficient than RL, achieving similar performance with fewer steps.
Distillation can effectively reuse training data, allowing multiple training epochs on the same prompt without overfitting.
On-policy distillation is useful for continual learning, enabling models to acquire new knowledge without degrading prior capabilities.
The method is applied to tasks like mathematical reasoning and personalized assistant training, demonstrating its versatility.

Hasty Briefsbeta