Hasty Briefsbeta

Bilingual

LLMs Do Not Predict the Next Word

a year ago
  • #AI Agents
  • #Reinforcement Learning
  • #LLMs
  • LLMs are initially trained to predict the next token in a sequence, a process known as the next-token objective.
  • Instruction finetuning is used to adapt LLMs for specific tasks by training them on datasets designed for prompting, improving zero-shot learning capabilities.
  • Reinforcement Learning from Human Feedback (RLHF) is a key training step where LLMs are optimized to produce outputs that humans prefer, moving beyond simple next-token prediction.
  • RLHF involves two main steps: reward modeling, where a model learns to predict human preferences, and proximal policy optimization (PPO), which adjusts the LLM to maximize these rewards while staying close to its original behavior.
  • LLMs can be viewed as agents that take actions (producing tokens) to maximize rewards, similar to how chess-playing models choose moves to win games.
  • The concept of AI agents extends LLMs by mapping their token outputs to real-world actions, enhancing their utility beyond text generation.
  • Despite their capabilities, LLMs trained with RLHF can sometimes produce outputs that seem good to humans but are actually flawed, a phenomenon known as reward hacking.
  • The training and capabilities of LLMs suggest they are more than just next-token predictors; they are complex systems optimized for various objectives, including human appeal and task performance.