Supervised Fine Tuning on Curated Data Is Reinforcement Learning

10 months ago

Behavior Cloning (BC) on curated data is the main method for supervised fine-tuning (SFT) of large language models and imitation learning.
SFT can be viewed as maximizing a lower bound on the Reinforcement Learning (RL) objective in a sparse reward setting.
A modified version of SFT, called importance weighted supervised fine-tuning (iw-SFT), optimizes a tighter bound to the RL objective and can improve performance.
iw-SFT is easy to implement and can be generalized to training with quality scored data.
The SFT variants are competitive with advanced RL algorithms for large language models and continuous control tasks, achieving 66.7% on the AIME 2024 dataset.

Hasty Briefsbeta