Direct Preference Optimization vs. RLHF

a year ago

Together Fine-Tuning Platform now supports Direct Preference Optimization (DPO).
DPO aligns language models with human preferences for more helpful, accurate, and tailored AI assistants.
Modern language model development involves pre-training, supervised fine-tuning (SFT), and preference-based learning.
DPO is an alternative to Reinforcement Learning from Human Feedback (RLHF).
DPO trains models directly on preference data without using reinforcement learning.
DPO adjusts model weights to increase the probability of preferred responses and decrease unpreferred ones.
DPO is simpler and more efficient than RLHF, avoiding the need for a reward model.
Combining SFT with DPO creates a more effective training pipeline.
DPO is ideal when prompting isn't sufficient, humans can compare better than create, and for controlled improvements.
DPO excels in tasks with nuanced quality judgments but not for tasks with single correct answers.
Key hyperparameter for DPO is --dpo-beta, controlling deviation from the reference model.
Monitoring DPO training involves metrics like accuracy and KL divergence.

Hasty Briefsbeta