Tree Search Distillation for Language Models Using PPO

a day ago

Tree search distillation for language models using PPO is explored to improve reasoning.
MCTS is applied to Qwen-2.5-1.5B-Instruct for stronger trajectories, distilled via online PPO.
On Countdown, the distilled model achieves 11.3% mean@16 eval score, outperforming CISPO (8.4%) and best-of-N (7.7%).
Countdown is chosen over GSM8K due to its combinatorial nature benefiting from tree search.
Dense reward function stabilizes training, while sparse rewards are used for evaluation.
Parallel MCTS with virtual losses enhances search diversity and efficiency.
Trajectory selection is based on maximum visit count, submitted to a shared buffer for PPO training.
Training uses CISPO loss, with a total loss combining PPO, value, and KL divergence objectives.
Infrastructure includes 8xH100 nodes, with separate generators and trainers synced via Redis.
Best-of-N underperforms, possibly due to lack of incentive for robust single-shot reasoning.
Future directions include tuning parallel workers and MCTS iterations for better performance.

Hasty Briefsbeta