LLMs Do Not Predict the Next Word

a year ago

LLMs are initially trained to predict the next token in a sequence, a process known as the next-token objective.
Instruction finetuning is used to adapt LLMs for specific tasks by training them on datasets designed for prompting, improving zero-shot learning capabilities.
Reinforcement Learning from Human Feedback (RLHF) is a key training step where LLMs are optimized to produce outputs that humans prefer, moving beyond simple next-token prediction.
RLHF involves two main steps: reward modeling, where a model learns to predict human preferences, and proximal policy optimization (PPO), which adjusts the LLM to maximize these rewards while staying close to its original behavior.
LLMs can be viewed as agents that take actions (producing tokens) to maximize rewards, similar to how chess-playing models choose moves to win games.
The concept of AI agents extends LLMs by mapping their token outputs to real-world actions, enhancing their utility beyond text generation.
Despite their capabilities, LLMs trained with RLHF can sometimes produce outputs that seem good to humans but are actually flawed, a phenomenon known as reward hacking.
The training and capabilities of LLMs suggest they are more than just next-token predictors; they are complex systems optimized for various objectives, including human appeal and task performance.

Hasty Briefsbeta