Hasty Briefsbeta

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in LLMs

2 days ago
  • #LLM Security
  • #Adversarial Attacks
  • #AI Alignment
  • Adversarial poetry acts as a universal single-turn jailbreak technique for large language models (LLMs).
  • Across 25 proprietary and open-weight models, poetic prompts achieved high attack-success rates (ASR), with some exceeding 90%.
  • Poetic attacks transfer across multiple risk domains including CBRN, manipulation, cyber-offense, and loss-of-control.
  • Converting harmful prompts into verse via a meta-prompt increased ASRs up to 18 times compared to prose baselines.
  • Outputs were evaluated using open-weight judge models and human-validated subsets, with disagreements manually resolved.
  • Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and 43% for meta-prompt conversions.
  • The findings highlight a systematic vulnerability in current alignment methods and evaluation protocols.