LLMs Do Not Predict the Next Word
a year ago
- #AI Agents
- #Reinforcement Learning
- #LLMs
- LLMs are initially trained to predict the next token in a sequence, a process known as the next-token objective.
- Instruction finetuning is used to adapt LLMs for specific tasks by training them on datasets designed for prompting, improving zero-shot learning capabilities.
- Reinforcement Learning from Human Feedback (RLHF) is a key training step where LLMs are optimized to produce outputs that humans prefer, moving beyond simple next-token prediction.
- RLHF involves two main steps: reward modeling, where a model learns to predict human preferences, and proximal policy optimization (PPO), which adjusts the LLM to maximize these rewards while staying close to its original behavior.
- LLMs can be viewed as agents that take actions (producing tokens) to maximize rewards, similar to how chess-playing models choose moves to win games.
- The concept of AI agents extends LLMs by mapping their token outputs to real-world actions, enhancing their utility beyond text generation.
- Despite their capabilities, LLMs trained with RLHF can sometimes produce outputs that seem good to humans but are actually flawed, a phenomenon known as reward hacking.
- The training and capabilities of LLMs suggest they are more than just next-token predictors; they are complex systems optimized for various objectives, including human appeal and task performance.