Fine-tuning and deploying Gemma 4 is not that easy
8 hours ago
- #Fine-tuning
- #Gemma 4
- #Deployment
- Google's Gemma 4 model introduced custom ClippableLinear layers that caused PEFT to reject them during LoRA fine-tuning due to inheritance from nn.Module instead of nn.Linear, requiring unwrapping or regex scoping.
- Training loss failed to converge because SFTTrainer forced use_cache=False, breaking Gemma 4's hybrid KV-sharing attention mechanism; this was fixed in transformers v5.5.2.
- DeepSpeed ZeRO-3 silently corrupted adapter saves by writing empty tensors for sharded parameters, requiring disabling DeepSpeed for LoRA fine-tuning on Gemma 4.
- Deployment required merging LoRA adapters into base weights before serving, as vLLM and SGLang do not support runtime LoRA for Gemma 4 due to architectural constraints.
- A reproducible notebook was provided for fine-tuning and deploying Gemma 4, including steps for dependency installation, training, and key remapping for vLLM compatibility.