Hasty Briefsbeta

Bilingual

Fine-tuning and deploying Gemma 4 is not that easy

8 hours ago
  • #Fine-tuning
  • #Gemma 4
  • #Deployment
  • Google's Gemma 4 model introduced custom ClippableLinear layers that caused PEFT to reject them during LoRA fine-tuning due to inheritance from nn.Module instead of nn.Linear, requiring unwrapping or regex scoping.
  • Training loss failed to converge because SFTTrainer forced use_cache=False, breaking Gemma 4's hybrid KV-sharing attention mechanism; this was fixed in transformers v5.5.2.
  • DeepSpeed ZeRO-3 silently corrupted adapter saves by writing empty tensors for sharded parameters, requiring disabling DeepSpeed for LoRA fine-tuning on Gemma 4.
  • Deployment required merging LoRA adapters into base weights before serving, as vLLM and SGLang do not support runtime LoRA for Gemma 4 due to architectural constraints.
  • A reproducible notebook was provided for fine-tuning and deploying Gemma 4, including steps for dependency installation, training, and key remapping for vLLM compatibility.