Hasty Briefsbeta

Bilingual

On-Policy Distillation

6 months ago
  • #reinforcement-learning
  • #machine-learning
  • #distillation
  • LLMs achieve expert performance through a combination of pre-training, mid-training, and post-training stages.
  • Smaller models with specialized training can outperform larger generalist models in specific domains.
  • On-policy training involves sampling from the student model and assigning rewards, while off-policy training relies on external target outputs.
  • On-policy distillation combines the relevance of RL with the dense reward signal of distillation, grading each token of the student's trajectory.
  • On-policy distillation is shown to be more compute-efficient than RL, achieving similar performance with fewer steps.
  • Distillation can effectively reuse training data, allowing multiple training epochs on the same prompt without overfitting.
  • On-policy distillation is useful for continual learning, enabling models to acquire new knowledge without degrading prior capabilities.
  • The method is applied to tasks like mathematical reasoning and personalized assistant training, demonstrating its versatility.