How to Train an LLM: Part 1
11 days ago
- #LLM Training
- #Deep Learning
- #PyTorch Optimization
- The author shares their journey of training a domain-specific 1B Llama 3-style model on 8×H100 GPUs.
- Initial setup includes using Karpathy's fine-web-edu-shuffled dataset and a 2048 token sequence length for training.
- Model architecture details are provided, including parameters like hidden_size, num_attention_heads, and num_hidden_layers.
- Memory usage estimation is discussed, covering weights, gradients, activations, and optimizer state.
- Training process optimization techniques are explored, such as torch.compile, mixed precision (BF16), and gradient checkpointing.
- Challenges faced include reproducibility issues with activation memory and convergence problems at 15k steps.
- The author reflects on the iterative process of refining the training infrastructure and the importance of empirical evidence.
- Future plans include improving training efficiency, exploring AdamW8bit optimizer, and further optimizing the model.