Bypassing Gemma and Qwen safety with raw strings
2 months ago
- #Prompt Engineering
- #LLM Vulnerabilities
- #AI Safety
- Open-source LLMs like Gemma and Qwen have safety alignment vulnerabilities when the `apply_chat_template()` function is omitted, allowing harmful content generation.
- Safety alignment in these models is not inherent in the weights but is dependent on specific chat formatting tokens (e.g., `<|im_start|>`).
- Experiments show that bypassing the chat template leads to significant drops in refusal rates for harmful prompts (e.g., Gemma-3 drops from 100% to 60%).
- Models revert to raw next-token prediction without proper formatting, revealing pre-training knowledge of harmful topics like bomb-making or scams.
- The issue is documented in research like 'ChatBug,' which identifies 'format mismatch attacks' as a common vulnerability in aligned LLMs.
- Recommendations include improving distributional robustness during training, using interceptors for inference-time safety checks, and deeper alignment in model weights.
- Current safety guarantees are fragile and rely on strict input formatting, making them unreliable if templates are bypassed or malformed.
- Future work includes testing larger models, expanding prompt diversity, and exploring cross-template vulnerabilities.