Hasty Briefsbeta

Bilingual

Bypassing Gemma and Qwen safety with raw strings

2 months ago
  • #Prompt Engineering
  • #LLM Vulnerabilities
  • #AI Safety
  • Open-source LLMs like Gemma and Qwen have safety alignment vulnerabilities when the `apply_chat_template()` function is omitted, allowing harmful content generation.
  • Safety alignment in these models is not inherent in the weights but is dependent on specific chat formatting tokens (e.g., `<|im_start|>`).
  • Experiments show that bypassing the chat template leads to significant drops in refusal rates for harmful prompts (e.g., Gemma-3 drops from 100% to 60%).
  • Models revert to raw next-token prediction without proper formatting, revealing pre-training knowledge of harmful topics like bomb-making or scams.
  • The issue is documented in research like 'ChatBug,' which identifies 'format mismatch attacks' as a common vulnerability in aligned LLMs.
  • Recommendations include improving distributional robustness during training, using interceptors for inference-time safety checks, and deeper alignment in model weights.
  • Current safety guarantees are fragile and rely on strict input formatting, making them unreliable if templates are bypassed or malformed.
  • Future work includes testing larger models, expanding prompt diversity, and exploring cross-template vulnerabilities.