Hasty Briefsbeta

Bilingual

Tree Search Distillation for Language Models Using PPO

a day ago
  • #Language Models
  • #Tree Search
  • #Reinforcement Learning
  • Tree search distillation for language models using PPO is explored to improve reasoning.
  • MCTS is applied to Qwen-2.5-1.5B-Instruct for stronger trajectories, distilled via online PPO.
  • On Countdown, the distilled model achieves 11.3% mean@16 eval score, outperforming CISPO (8.4%) and best-of-N (7.7%).
  • Countdown is chosen over GSM8K due to its combinatorial nature benefiting from tree search.
  • Dense reward function stabilizes training, while sparse rewards are used for evaluation.
  • Parallel MCTS with virtual losses enhances search diversity and efficiency.
  • Trajectory selection is based on maximum visit count, submitted to a shared buffer for PPO training.
  • Training uses CISPO loss, with a total loss combining PPO, value, and KL divergence objectives.
  • Infrastructure includes 8xH100 nodes, with separate generators and trainers synced via Redis.
  • Best-of-N underperforms, possibly due to lack of incentive for robust single-shot reasoning.
  • Future directions include tuning parallel workers and MCTS iterations for better performance.