Hasty Briefsbeta

Bilingual

Supervised Fine Tuning on Curated Data Is Reinforcement Learning

9 months ago
  • #Machine Learning
  • #Supervised Fine-Tuning
  • #Reinforcement Learning
  • Behavior Cloning (BC) on curated data is the main method for supervised fine-tuning (SFT) of large language models and imitation learning.
  • SFT can be viewed as maximizing a lower bound on the Reinforcement Learning (RL) objective in a sparse reward setting.
  • A modified version of SFT, called importance weighted supervised fine-tuning (iw-SFT), optimizes a tighter bound to the RL objective and can improve performance.
  • iw-SFT is easy to implement and can be generalized to training with quality scored data.
  • The SFT variants are competitive with advanced RL algorithms for large language models and continuous control tasks, achieving 66.7% on the AIME 2024 dataset.