Writing an LLM from scratch, part 32h – Interventions: full fat float32
8 hours ago
- #AMP vs float32
- #precision optimization
- #LLM training
- The author tested removing AMP and lower-precision matrix multiplications (setting float32 to full precision) to see if it improves test loss for a GPT-2 small base model trained on code.
- Removing these optimizations increased training time from 3h24m to over 8 hours and server costs from ~$42 to over $135 due to using a larger VRAM machine (8x A100 80 GiB).
- The test loss improved slightly from 3.692 to 3.679, a minimal gain compared to other interventions, and the author concludes AMP is a huge win for training speed and VRAM efficiency with negligible quality cost.