Hasty Briefsbeta

RL is more information inefficient than you thought

14 days ago
  • #Machine Learning Efficiency
  • #Reinforcement Learning
  • #Supervised Learning
  • RL requires more FLOPs per sample compared to supervised learning due to longer trajectories needed for a single reward signal.
  • Supervised learning provides dense information per token, while RL's information density per sample is much lower early in training.
  • Information efficiency in learning is compared using Bits/FLOP = Samples/FLOP * Bits/Sample, highlighting RL's inefficiency in Bits/Sample.
  • Supervised learning maximizes information when the model is uncertain, providing clear feedback on each token.
  • RL struggles early as models are unlikely to guess correctly, leading to inefficient learning from sparse rewards.
  • Pass rate (probability of correct answer) affects information gain: supervised learning benefits from low pass rates, RL from moderate ones.
  • RL's learning efficiency is poor at low pass rates, improving only when models are already competent.
  • RL suffers from high variance in gradient estimates early in training, unlike supervised learning which faces this issue late in training.
  • Curriculum learning and self-play can help RL by maintaining an optimal pass rate for learning.
  • Proxy objectives like value functions or process-reward models could improve RL's learning efficiency but are hard to develop for LLMs.
  • RL learns fewer but more valuable bits, directly related to task performance, unlike pretraining's broader data manifold learning.
  • RL can lead to jagged performance, excelling in specific tasks while missing generalizable strategies.
  • Human learning is more efficient, leveraging continuous feedback and world model updates beyond binary outcomes.
  • Active Inference is proposed as a better model for learning, focusing on minimizing surprise rather than explicit goals.