Writing an LLM from scratch, part 32h – Interventions: full fat float32

6 hours ago

The author tested removing AMP and lower-precision matrix multiplications (setting float32 to full precision) to see if it improves test loss for a GPT-2 small base model trained on code.
Removing these optimizations increased training time from 3h24m to over 8 hours and server costs from ~$42 to over $135 due to using a larger VRAM machine (8x A100 80 GiB).
The test loss improved slightly from 3.692 to 3.679, a minimal gain compared to other interventions, and the author concludes AMP is a huge win for training speed and VRAM efficiency with negligible quality cost.

Hasty Briefsbeta