New Prompt Injection Papers: Agents Rule of Two and the Attacker Moves Second
6 months ago
- #LLM Security
- #AI Agents
- #Prompt Injection
- Two new papers on LLM security and prompt injection were discussed.
- Agents Rule of Two: A Practical Approach to AI Agent Security proposes a 'Rule of Two' inspired by the lethal trifecta and Google Chrome's Rule of 2.
- The rule states that agents must satisfy no more than two of three properties to avoid high-impact consequences of prompt injection.
- The three properties are: processing untrustworthy inputs, access to sensitive systems/data, and changing state or communicating externally.
- The lethal trifecta model is limited to data exfiltration risks, while the Rule of Two includes changing state, covering more risks.
- The Attacker Moves Second paper evaluates 12 defenses against prompt injection and jailbreaking using adaptive attacks.
- Adaptive attacks, including gradient-based, reinforcement learning, and search-based methods, defeated most defenses with over 90% success rate.
- Human red-teaming achieved a 100% success rate against all defenses.
- The paper emphasizes the importance of adaptive evaluations for defense development.
- The conclusion suggests that reliable defenses against prompt injection are not yet available, supporting the Agents Rule of Two as current best practice.