Hasty Briefsbeta

Bilingual

A Theory of Why Prompt Injection Works

5 hours ago
  • #LLM roles
  • #AI security
  • #prompt injection
  • Prompt injections exploit a flaw in how LLMs perceive roles, where style overrides role tags, leading to role confusion.
  • LLMs receive input as a continuous token stream; role tags (system, user, think, assistant, tool) are meant to impose structure but are internally insecure.
  • Role probes reveal that LLMs identify roles based on writing style rather than tags, allowing attackers to spoof roles (e.g., CoT Forgery mimics reasoning style).
  • Prompt injection success correlates with how much the LLM internally perceives injected text as belonging to a privileged role (e.g., user or think).
  • Roles isolate competing objectives (e.g., think vs. assistant for exploration vs. communication), but confusion between them enables attacks and subtle steering.
  • Subconscious steering through role boundaries could allow legal, large-scale manipulation of LLM states (e.g., in e-commerce), with limited current research.
  • Future roles research could explore new roles for objective conflicts, dynamic roles, and roles as a window into LLM cognition and representation.