Supervised Fine Tuning on Curated Data Is Reinforcement Learning
9 months ago
- #Machine Learning
- #Supervised Fine-Tuning
- #Reinforcement Learning
- Behavior Cloning (BC) on curated data is the main method for supervised fine-tuning (SFT) of large language models and imitation learning.
- SFT can be viewed as maximizing a lower bound on the Reinforcement Learning (RL) objective in a sparse reward setting.
- A modified version of SFT, called importance weighted supervised fine-tuning (iw-SFT), optimizes a tighter bound to the RL objective and can improve performance.
- iw-SFT is easy to implement and can be generalized to training with quality scored data.
- The SFT variants are competitive with advanced RL algorithms for large language models and continuous control tasks, achieving 66.7% on the AIME 2024 dataset.