Hasty Briefsbeta

Fantastic Pretraining Optimizers and Where to Find Them

5 days ago
  • #optimizers
  • #pretraining
  • #machine-learning
  • AdamW has been the dominant optimizer in language model pretraining despite claims of faster alternatives.
  • Two methodological issues hinder fair comparisons: unequal hyperparameter tuning and limited evaluation setups.
  • A systematic study of ten optimizers across model scales (0.1B-1.2B parameters) and data-to-model ratios was conducted.
  • Fair comparisons require rigorous hyperparameter tuning and evaluations at the end of training.
  • Optimal hyperparameters for one optimizer may be suboptimal for another, making blind transfers unfair.
  • Actual speedup of proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size.
  • Comparing intermediate checkpoints can be misleading due to learning rate decay effects.
  • Fastest optimizers (Muon, Soap) use matrices as preconditioners, but speedup decreases with model scale.