Hasty Briefsbeta

Bilingual

Writing an LLM from scratch, part 32h – Interventions: full fat float32

6 hours ago
  • #AMP vs float32
  • #precision optimization
  • #LLM training
  • The author tested removing AMP and lower-precision matrix multiplications (setting float32 to full precision) to see if it improves test loss for a GPT-2 small base model trained on code.
  • Removing these optimizations increased training time from 3h24m to over 8 hours and server costs from ~$42 to over $135 due to using a larger VRAM machine (8x A100 80 GiB).
  • The test loss improved slightly from 3.692 to 3.679, a minimal gain compared to other interventions, and the author concludes AMP is a huge win for training speed and VRAM efficiency with negligible quality cost.