Hasty Briefsbeta

Writing an LLM from scratch, part 20 – starting training, and cross entropy loss

11 hours ago
  • #LLM Training
  • #Cross Entropy Loss
  • #Information Theory
  • Training an LLM involves using gradient descent to minimize a loss function that measures the inaccuracy of the model's predictions.
  • The loss function used is cross entropy loss, which compares the model's predicted probabilities against the actual target tokens in the training data.
  • Cross entropy loss simplifies when training targets are one-hot vectors, reducing to the negative log probability of the correct token.
  • Entropy in information theory measures the 'messiness' or uncertainty of a probability distribution, with higher entropy indicating more uncertainty.
  • Cross entropy extends this concept by comparing the model's predicted distribution against the true distribution, providing a measure of how well the model's predictions match reality.
  • The training process involves averaging the cross entropy loss across all prefix sequence/target pairs in a batch to guide the model's parameter updates.