Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in LLMs
2 days ago
- #LLM Security
- #Adversarial Attacks
- #AI Alignment
- Adversarial poetry acts as a universal single-turn jailbreak technique for large language models (LLMs).
- Across 25 proprietary and open-weight models, poetic prompts achieved high attack-success rates (ASR), with some exceeding 90%.
- Poetic attacks transfer across multiple risk domains including CBRN, manipulation, cyber-offense, and loss-of-control.
- Converting harmful prompts into verse via a meta-prompt increased ASRs up to 18 times compared to prose baselines.
- Outputs were evaluated using open-weight judge models and human-validated subsets, with disagreements manually resolved.
- Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and 43% for meta-prompt conversions.
- The findings highlight a systematic vulnerability in current alignment methods and evaluation protocols.