Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in LLMs

2 days ago

Copy Link

Adversarial poetry acts as a universal single-turn jailbreak technique for large language models (LLMs).
Across 25 proprietary and open-weight models, poetic prompts achieved high attack-success rates (ASR), with some exceeding 90%.
Poetic attacks transfer across multiple risk domains including CBRN, manipulation, cyber-offense, and loss-of-control.
Converting harmful prompts into verse via a meta-prompt increased ASRs up to 18 times compared to prose baselines.
Outputs were evaluated using open-weight judge models and human-validated subsets, with disagreements manually resolved.
Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and 43% for meta-prompt conversions.
The findings highlight a systematic vulnerability in current alignment methods and evaluation protocols.

Hasty Briefsbeta