RL is more information inefficient than you thought

14 days ago

https://www.dwarkesh.com/p/bits-per-sample

Copy Link

#Machine Learning Efficiency
#Reinforcement Learning
#Supervised Learning

RL requires more FLOPs per sample compared to supervised learning due to longer trajectories needed for a single reward signal.
Supervised learning provides dense information per token, while RL's information density per sample is much lower early in training.
Information efficiency in learning is compared using Bits/FLOP = Samples/FLOP * Bits/Sample, highlighting RL's inefficiency in Bits/Sample.
Supervised learning maximizes information when the model is uncertain, providing clear feedback on each token.
RL struggles early as models are unlikely to guess correctly, leading to inefficient learning from sparse rewards.
Pass rate (probability of correct answer) affects information gain: supervised learning benefits from low pass rates, RL from moderate ones.
RL's learning efficiency is poor at low pass rates, improving only when models are already competent.
RL suffers from high variance in gradient estimates early in training, unlike supervised learning which faces this issue late in training.
Curriculum learning and self-play can help RL by maintaining an optimal pass rate for learning.
Proxy objectives like value functions or process-reward models could improve RL's learning efficiency but are hard to develop for LLMs.
RL learns fewer but more valuable bits, directly related to task performance, unlike pretraining's broader data manifold learning.
RL can lead to jagged performance, excelling in specific tasks while missing generalizable strategies.
Human learning is more efficient, leveraging continuous feedback and world model updates beyond binary outcomes.
Active Inference is proposed as a better model for learning, focusing on minimizing surprise rather than explicit goals.

Hasty Briefsbeta

RL is more information inefficient than you thought