The annotated PyTorch training loop
3 days ago
- #Common Mistakes
- #PyTorch Training
- #Deep Learning
- Building a PyTorch training loop is straightforward but requires careful placement of operations to avoid failures in convergence, results, or memory usage.
- Common mistakes include incorrect placement of model.to(device), optimiser.zero_grad(), clip_grad_norm_(), scheduler.step(), and omitting model.train() or torch.no_grad(), which cause issues like gradient accumulation, ineffective clipping, or memory growth.
- The training loop structure involves data preparation with DataLoader, model setup, loss function, optimiser, and scheduler, followed by a loop for training and validation phases.
- Data pipeline uses Dataset and DataLoader for batching, shuffling, and optional parallel prefetching with num_workers and pin_memory to optimize GPU utilization.
- Model definition via nn.Module requires __init__ and forward methods; model.to(device) must be called before optimiser construction to avoid referencing outdated parameters.
- Training mode (model.train()) enables dropout and batch norm updates, while evaluation mode (model.eval()) disables them; torch.no_grad() prevents graph construction during validation to save memory.
- Optimiser zeroes gradients with zero_grad() before backward pass to prevent accumulation; loss.backward() computes gradients; clip_grad_norm_() clips gradients after backward; optimiser.step() updates weights.
- Scheduler adjusts learning rate, typically after each epoch, not per batch, to avoid excessive decay.
- Checkpointing saves model, optimiser, and scheduler states for resuming training, with best models saved based on validation loss.
- Performance optimizations include using non_blocking data transfers, mixed precision training with GradScaler for float16 or bfloat16, torch.compile for kernel fusion, and DataLoader settings like prefetch_factor and cudnn.benchmark.