Tree Search Distillation for Language Models Using PPO
a day ago
- #Language Models
- #Tree Search
- #Reinforcement Learning
- Tree search distillation for language models using PPO is explored to improve reasoning.
- MCTS is applied to Qwen-2.5-1.5B-Instruct for stronger trajectories, distilled via online PPO.
- On Countdown, the distilled model achieves 11.3% mean@16 eval score, outperforming CISPO (8.4%) and best-of-N (7.7%).
- Countdown is chosen over GSM8K due to its combinatorial nature benefiting from tree search.
- Dense reward function stabilizes training, while sparse rewards are used for evaluation.
- Parallel MCTS with virtual losses enhances search diversity and efficiency.
- Trajectory selection is based on maximum visit count, submitted to a shared buffer for PPO training.
- Training uses CISPO loss, with a total loss combining PPO, value, and KL divergence objectives.
- Infrastructure includes 8xH100 nodes, with separate generators and trainers synced via Redis.
- Best-of-N underperforms, possibly due to lack of incentive for robust single-shot reasoning.
- Future directions include tuning parallel workers and MCTS iterations for better performance.