Fantastic Pretraining Optimizers and Where to Find Them
5 days ago
- #optimizers
- #pretraining
- #machine-learning
- AdamW has been the dominant optimizer in language model pretraining despite claims of faster alternatives.
- Two methodological issues hinder fair comparisons: unequal hyperparameter tuning and limited evaluation setups.
- A systematic study of ten optimizers across model scales (0.1B-1.2B parameters) and data-to-model ratios was conducted.
- Fair comparisons require rigorous hyperparameter tuning and evaluations at the end of training.
- Optimal hyperparameters for one optimizer may be suboptimal for another, making blind transfers unfair.
- Actual speedup of proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size.
- Comparing intermediate checkpoints can be misleading due to learning rate decay effects.
- Fastest optimizers (Muon, Soap) use matrices as preconditioners, but speedup decreases with model scale.