Hasty Briefsbeta

Writing an LLM from scratch, part 22 – training our LLM

a day ago
  • #LLM Training
  • #GPT-2
  • #Machine Learning
  • The post concludes the author's notes on chapter 5 of Sebastian Raschka's book, focusing on training an LLM from scratch.
  • Highlights include understanding cross entropy loss and perplexity, and the excitement of seeing the model generate text after training.
  • The author trained the model on a small dataset ('The Verdict' by Edith Wharton) and observed surprisingly coherent outputs.
  • Using pre-trained GPT-2 weights from OpenAI significantly improved the model's output coherence.
  • The post discusses challenges with randomness and seeding in replicating book examples exactly.
  • Optimizers like AdamW are introduced, with a brief explanation of their role in training, though the author plans to explore them in more detail later.
  • A notable observation was the significant speed difference in training times between a MacBook Air and an RTX 3090 GPU.
  • The author expresses curiosity about the cost of training a 124M parameter model on personal or rented hardware.
  • Techniques to prevent 'memorization' (or 'parroting') in model outputs, such as temperature adjustment and top-k sampling, are discussed.
  • The process of downloading and integrating OpenAI's GPT-2 weights into the custom model is covered, with advice on best practices.
  • The post ends with anticipation for the next chapter on text classification using the trained model.