Direct Preference Optimization vs. RLHF
a year ago
- #AI
- #Machine Learning
- #Fine-Tuning
- Together Fine-Tuning Platform now supports Direct Preference Optimization (DPO).
- DPO aligns language models with human preferences for more helpful, accurate, and tailored AI assistants.
- Modern language model development involves pre-training, supervised fine-tuning (SFT), and preference-based learning.
- DPO is an alternative to Reinforcement Learning from Human Feedback (RLHF).
- DPO trains models directly on preference data without using reinforcement learning.
- DPO adjusts model weights to increase the probability of preferred responses and decrease unpreferred ones.
- DPO is simpler and more efficient than RLHF, avoiding the need for a reward model.
- Combining SFT with DPO creates a more effective training pipeline.
- DPO is ideal when prompting isn't sufficient, humans can compare better than create, and for controlled improvements.
- DPO excels in tasks with nuanced quality judgments but not for tasks with single correct answers.
- Key hyperparameter for DPO is --dpo-beta, controlling deviation from the reference model.
- Monitoring DPO training involves metrics like accuracy and KL divergence.