A Theory of Why Prompt Injection Works

4 hours ago

Prompt injections exploit a flaw in how LLMs perceive roles, where style overrides role tags, leading to role confusion.
LLMs receive input as a continuous token stream; role tags (system, user, think, assistant, tool) are meant to impose structure but are internally insecure.
Role probes reveal that LLMs identify roles based on writing style rather than tags, allowing attackers to spoof roles (e.g., CoT Forgery mimics reasoning style).
Prompt injection success correlates with how much the LLM internally perceives injected text as belonging to a privileged role (e.g., user or think).
Roles isolate competing objectives (e.g., think vs. assistant for exploration vs. communication), but confusion between them enables attacks and subtle steering.
Subconscious steering through role boundaries could allow legal, large-scale manipulation of LLM states (e.g., in e-commerce), with limited current research.
Future roles research could explore new roles for objective conflicts, dynamic roles, and roles as a window into LLM cognition and representation.

Hasty Briefsbeta