The annotated PyTorch training loop

3 days ago

#Common Mistakes
#PyTorch Training
#Deep Learning

Building a PyTorch training loop is straightforward but requires careful placement of operations to avoid failures in convergence, results, or memory usage.
Common mistakes include incorrect placement of model.to(device), optimiser.zero_grad(), clip_grad_norm_(), scheduler.step(), and omitting model.train() or torch.no_grad(), which cause issues like gradient accumulation, ineffective clipping, or memory growth.
The training loop structure involves data preparation with DataLoader, model setup, loss function, optimiser, and scheduler, followed by a loop for training and validation phases.
Data pipeline uses Dataset and DataLoader for batching, shuffling, and optional parallel prefetching with num_workers and pin_memory to optimize GPU utilization.
Model definition via nn.Module requires __init__ and forward methods; model.to(device) must be called before optimiser construction to avoid referencing outdated parameters.
Training mode (model.train()) enables dropout and batch norm updates, while evaluation mode (model.eval()) disables them; torch.no_grad() prevents graph construction during validation to save memory.
Optimiser zeroes gradients with zero_grad() before backward pass to prevent accumulation; loss.backward() computes gradients; clip_grad_norm_() clips gradients after backward; optimiser.step() updates weights.
Scheduler adjusts learning rate, typically after each epoch, not per batch, to avoid excessive decay.
Checkpointing saves model, optimiser, and scheduler states for resuming training, with best models saved based on validation loss.
Performance optimizations include using non_blocking data transfers, mixed precision training with GradScaler for float16 or bfloat16, torch.compile for kernel fusion, and DataLoader settings like prefetch_factor and cudnn.benchmark.

Hasty Briefsbeta

The annotated PyTorch training loop