Fantastic Pretraining Optimizers and Where to Find Them

5 days ago

Copy Link

AdamW has been the dominant optimizer in language model pretraining despite claims of faster alternatives.
Two methodological issues hinder fair comparisons: unequal hyperparameter tuning and limited evaluation setups.
A systematic study of ten optimizers across model scales (0.1B-1.2B parameters) and data-to-model ratios was conducted.
Fair comparisons require rigorous hyperparameter tuning and evaluations at the end of training.
Optimal hyperparameters for one optimizer may be suboptimal for another, making blind transfers unfair.
Actual speedup of proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size.
Comparing intermediate checkpoints can be misleading due to learning rate decay effects.
Fastest optimizers (Muon, Soap) use matrices as preconditioners, but speedup decreases with model scale.

Hasty Briefsbeta