Bypassing Gemma and Qwen safety with raw strings

2 months ago

Open-source LLMs like Gemma and Qwen have safety alignment vulnerabilities when the `apply_chat_template()` function is omitted, allowing harmful content generation.
Safety alignment in these models is not inherent in the weights but is dependent on specific chat formatting tokens (e.g., `<|im_start|>`).
Experiments show that bypassing the chat template leads to significant drops in refusal rates for harmful prompts (e.g., Gemma-3 drops from 100% to 60%).
Models revert to raw next-token prediction without proper formatting, revealing pre-training knowledge of harmful topics like bomb-making or scams.
The issue is documented in research like 'ChatBug,' which identifies 'format mismatch attacks' as a common vulnerability in aligned LLMs.
Recommendations include improving distributional robustness during training, using interceptors for inference-time safety checks, and deeper alignment in model weights.
Current safety guarantees are fragile and rely on strict input formatting, making them unreliable if templates are bypassed or malformed.
Future work includes testing larger models, expanding prompt diversity, and exploring cross-template vulnerabilities.

Hasty Briefsbeta