DeepMind's paper on p0wning Claws and what we learned
6 hours ago
- #AI Security
- #Autonomous Agents
- #Adversarial Attacks
- DeepMind's 'AI Agent Traps' paper categorizes six ways autonomous AI agents can be compromised, with real-world tests showing all agents were compromised at least once.
- Content Injection involves hidden instructions in HTML or metadata invisible to humans but readable by models, with up to 86% success in tests.
- Semantic Manipulation exploits cognitive biases like authority deference, using impersonation or emotional framing to bend the agent's reasoning.
- Cognitive State attacks poison knowledge bases, such as RAG systems, where adversarial content can lie dormant until triggered by specific queries.
- Behavioral Control leads agents to perform harmful actions, like leaking data, emphasizing the need for architectural defenses like egress firewalls.
- Systemic attacks propagate in multi-agent networks, with sub-agent hijacking success rates of 58–90%, highlighting risks of cognitive monoculture.
- Human-in-the-Loop exploits operator automation bias, where humans trust malicious agent outputs without detection, a cognitive rather than software problem.
- Key defenses include HTML sanitization, strict network routing, data provenance tracking, circuit breakers, and red-teaming against the full taxonomy.
- The gap between behavioral instructions (system prompts) and architectural defenses is critical, as prompts alone cover only about 10% of the attack surface.