A Theory of Why Prompt Injection Works
4 hours ago
- #LLM roles
- #AI security
- #prompt injection
- Prompt injections exploit a flaw in how LLMs perceive roles, where style overrides role tags, leading to role confusion.
- LLMs receive input as a continuous token stream; role tags (system, user, think, assistant, tool) are meant to impose structure but are internally insecure.
- Role probes reveal that LLMs identify roles based on writing style rather than tags, allowing attackers to spoof roles (e.g., CoT Forgery mimics reasoning style).
- Prompt injection success correlates with how much the LLM internally perceives injected text as belonging to a privileged role (e.g., user or think).
- Roles isolate competing objectives (e.g., think vs. assistant for exploration vs. communication), but confusion between them enables attacks and subtle steering.
- Subconscious steering through role boundaries could allow legal, large-scale manipulation of LLM states (e.g., in e-commerce), with limited current research.
- Future roles research could explore new roles for objective conflicts, dynamic roles, and roles as a window into LLM cognition and representation.