How to Train an LLM: Part 1

11 days ago

Copy Link

The author shares their journey of training a domain-specific 1B Llama 3-style model on 8×H100 GPUs.
Initial setup includes using Karpathy's fine-web-edu-shuffled dataset and a 2048 token sequence length for training.
Model architecture details are provided, including parameters like hidden_size, num_attention_heads, and num_hidden_layers.
Memory usage estimation is discussed, covering weights, gradients, activations, and optimizer state.
Training process optimization techniques are explored, such as torch.compile, mixed precision (BF16), and gradient checkpointing.
Challenges faced include reproducibility issues with activation memory and convergence problems at 15k steps.
The author reflects on the iterative process of refining the training infrastructure and the importance of empirical evidence.
Future plans include improving training efficiency, exploring AdamW8bit optimizer, and further optimizing the model.

Hasty Briefsbeta