Prompt Injection via Poetry

8 days ago

Copy Link

Researchers found that AI chatbots like ChatGPT can be tricked into answering dangerous questions if prompts are phrased as poems.
The study achieved a 62% success rate with hand-crafted poems and 43% with meta-prompt conversions in bypassing AI safety measures.
Poetic prompts confuse AI guardrails by using metaphors, fragmented syntax, and oblique references, achieving up to 90% success on some models.
The researchers did not share examples of the jailbreaking poetry, deeming them too dangerous for public release.
AI guardrails, like classifiers, often fail to recognize dangerous content when it's disguised in poetic form due to stylistic variations.
The study suggests that poetic transformations allow prompts to avoid triggering safety mechanisms by navigating the AI's internal representation differently.

Hasty Briefsbeta