Hasty Briefsbeta

Bilingual

LoRA and Weight Decay (2023)

2 days ago
  • #LoRA
  • #weight decay
  • #fine-tuning
  • LoRA is a low-rank adapter method for fine-tuning LLMs that adds small matrices to modify original weights, reducing parameters significantly compared to full fine-tuning.
  • LoRA with weight decay regularizes the solution towards the original frozen weights (W → W_init), unlike full fine-tuning which regularizes towards zero (W → 0).
  • Even with unlimited resources or high ranks, LoRA does not approximate full fine-tuning due to its different implicit optimization objective.
  • A corrected regularization term for LoRA can be derived to make adapted weights decay to zero, aligning it with full fine-tuning, and is implementable in libraries like Optax.
  • The default LoRA behavior can be seen as a feature with small datasets (keeping close to base model) or a bug with large datasets (limiting adaptation).
  • Weight decay equivalency as L2 regularization holds only for non-momentum optimizers; AdamW decouples weight decay to avoid momentum-related issues.
  • Empirical validation is needed to determine if LoRA's regularization towards base weights yields solutions as good as full fine-tuning in practice.