Writing an LLM from scratch, part 20 – starting training, and cross entropy loss
11 hours ago
- #LLM Training
- #Cross Entropy Loss
- #Information Theory
- Training an LLM involves using gradient descent to minimize a loss function that measures the inaccuracy of the model's predictions.
- The loss function used is cross entropy loss, which compares the model's predicted probabilities against the actual target tokens in the training data.
- Cross entropy loss simplifies when training targets are one-hot vectors, reducing to the negative log probability of the correct token.
- Entropy in information theory measures the 'messiness' or uncertainty of a probability distribution, with higher entropy indicating more uncertainty.
- Cross entropy extends this concept by comparing the model's predicted distribution against the true distribution, providing a measure of how well the model's predictions match reality.
- The training process involves averaging the cross entropy loss across all prefix sequence/target pairs in a batch to guide the model's parameter updates.