Fine-tuning and deploying Gemma 4 is not that easy

8 hours ago

Google's Gemma 4 model introduced custom ClippableLinear layers that caused PEFT to reject them during LoRA fine-tuning due to inheritance from nn.Module instead of nn.Linear, requiring unwrapping or regex scoping.
Training loss failed to converge because SFTTrainer forced use_cache=False, breaking Gemma 4's hybrid KV-sharing attention mechanism; this was fixed in transformers v5.5.2.
DeepSpeed ZeRO-3 silently corrupted adapter saves by writing empty tensors for sharded parameters, requiring disabling DeepSpeed for LoRA fine-tuning on Gemma 4.
Deployment required merging LoRA adapters into base weights before serving, as vLLM and SGLang do not support runtime LoRA for Gemma 4 due to architectural constraints.
A reproducible notebook was provided for fine-tuning and deploying Gemma 4, including steps for dependency installation, training, and key remapping for vLLM compatibility.

Hasty Briefsbeta