Thinking Machines – LoRA Without Regret
10 hours ago
- #Parameter-Efficient Fine-Tuning
- #LoRA
- #Machine Learning
- LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that modifies weight matrices in large language models by adding a low-rank update, reducing the number of trainable parameters.
- LoRA offers advantages over full fine-tuning (FullFT) in multi-tenant serving, training layout size, and ease of loading/transfer due to its smaller memory footprint and faster setup.
- LoRA performs comparably to FullFT in supervised fine-tuning on small-to-medium datasets but underperforms when dataset size exceeds LoRA's capacity.
- LoRA is less tolerant of large batch sizes compared to FullFT, showing a performance gap that increases with batch size, independent of rank.
- Applying LoRA to all layers, especially MLP/MoE layers, yields better performance than attention-only LoRA, even when matching the number of trainable parameters.
- In reinforcement learning, LoRA matches FullFT performance even with very low ranks (e.g., rank=1), as RL requires less capacity due to limited information per episode.
- Optimal learning rates for LoRA are consistently ~10x higher than for FullFT, and LoRA's compute efficiency is slightly better (~2/3 the FLOPs of FullFT).
- Key hyperparameters for LoRA include rank, learning rate, and initialization scales, with invariances reducing the effective parameter space to tune.
- LoRA's performance is similar to FullFT when applied to all layers and when not capacity-constrained, making it suitable for most post-training scenarios.
- Open questions remain about sharpening performance predictions, theoretical understanding of LoRA dynamics, and evaluating LoRA variants like PiSSA.