Writing an LLM from scratch, part 22 – training our LLM

a day ago

Copy Link

The post concludes the author's notes on chapter 5 of Sebastian Raschka's book, focusing on training an LLM from scratch.
Highlights include understanding cross entropy loss and perplexity, and the excitement of seeing the model generate text after training.
The author trained the model on a small dataset ('The Verdict' by Edith Wharton) and observed surprisingly coherent outputs.
Using pre-trained GPT-2 weights from OpenAI significantly improved the model's output coherence.
The post discusses challenges with randomness and seeding in replicating book examples exactly.
Optimizers like AdamW are introduced, with a brief explanation of their role in training, though the author plans to explore them in more detail later.
A notable observation was the significant speed difference in training times between a MacBook Air and an RTX 3090 GPU.
The author expresses curiosity about the cost of training a 124M parameter model on personal or rented hardware.
Techniques to prevent 'memorization' (or 'parroting') in model outputs, such as temperature adjustment and top-k sampling, are discussed.
The process of downloading and integrating OpenAI's GPT-2 weights into the custom model is covered, with advice on best practices.
The post ends with anticipation for the next chapter on text classification using the trained model.

Hasty Briefsbeta