LoRA and Weight Decay (2023)

2 days ago

LoRA is a low-rank adapter method for fine-tuning LLMs that adds small matrices to modify original weights, reducing parameters significantly compared to full fine-tuning.
LoRA with weight decay regularizes the solution towards the original frozen weights (W → W_init), unlike full fine-tuning which regularizes towards zero (W → 0).
Even with unlimited resources or high ranks, LoRA does not approximate full fine-tuning due to its different implicit optimization objective.
A corrected regularization term for LoRA can be derived to make adapted weights decay to zero, aligning it with full fine-tuning, and is implementable in libraries like Optax.
The default LoRA behavior can be seen as a feature with small datasets (keeping close to base model) or a bug with large datasets (limiting adaptation).
Weight decay equivalency as L2 regularization holds only for non-momentum optimizers; AdamW decouples weight decay to avoid momentum-related issues.
Empirical validation is needed to determine if LoRA's regularization towards base weights yields solutions as good as full fine-tuning in practice.

Hasty Briefsbeta