Writing an LLM from scratch, part 20 – starting training, and cross entropy loss

11 hours ago

Copy Link

Training an LLM involves using gradient descent to minimize a loss function that measures the inaccuracy of the model's predictions.
The loss function used is cross entropy loss, which compares the model's predicted probabilities against the actual target tokens in the training data.
Cross entropy loss simplifies when training targets are one-hot vectors, reducing to the negative log probability of the correct token.
Entropy in information theory measures the 'messiness' or uncertainty of a probability distribution, with higher entropy indicating more uncertainty.
Cross entropy extends this concept by comparing the model's predicted distribution against the true distribution, providing a measure of how well the model's predictions match reality.
The training process involves averaging the cross entropy loss across all prefix sequence/target pairs in a batch to guide the model's parameter updates.

Hasty Briefsbeta