RL is more information inefficient than you thought
14 days ago
- #Machine Learning Efficiency
- #Reinforcement Learning
- #Supervised Learning
- RL requires more FLOPs per sample compared to supervised learning due to longer trajectories needed for a single reward signal.
- Supervised learning provides dense information per token, while RL's information density per sample is much lower early in training.
- Information efficiency in learning is compared using Bits/FLOP = Samples/FLOP * Bits/Sample, highlighting RL's inefficiency in Bits/Sample.
- Supervised learning maximizes information when the model is uncertain, providing clear feedback on each token.
- RL struggles early as models are unlikely to guess correctly, leading to inefficient learning from sparse rewards.
- Pass rate (probability of correct answer) affects information gain: supervised learning benefits from low pass rates, RL from moderate ones.
- RL's learning efficiency is poor at low pass rates, improving only when models are already competent.
- RL suffers from high variance in gradient estimates early in training, unlike supervised learning which faces this issue late in training.
- Curriculum learning and self-play can help RL by maintaining an optimal pass rate for learning.
- Proxy objectives like value functions or process-reward models could improve RL's learning efficiency but are hard to develop for LLMs.
- RL learns fewer but more valuable bits, directly related to task performance, unlike pretraining's broader data manifold learning.
- RL can lead to jagged performance, excelling in specific tasks while missing generalizable strategies.
- Human learning is more efficient, leveraging continuous feedback and world model updates beyond binary outcomes.
- Active Inference is proposed as a better model for learning, focusing on minimizing surprise rather than explicit goals.