DeepMind's paper on p0wning Claws and what we learned

4 hours ago

DeepMind's 'AI Agent Traps' paper categorizes six ways autonomous AI agents can be compromised, with real-world tests showing all agents were compromised at least once.
Content Injection involves hidden instructions in HTML or metadata invisible to humans but readable by models, with up to 86% success in tests.
Semantic Manipulation exploits cognitive biases like authority deference, using impersonation or emotional framing to bend the agent's reasoning.
Cognitive State attacks poison knowledge bases, such as RAG systems, where adversarial content can lie dormant until triggered by specific queries.
Behavioral Control leads agents to perform harmful actions, like leaking data, emphasizing the need for architectural defenses like egress firewalls.
Systemic attacks propagate in multi-agent networks, with sub-agent hijacking success rates of 58–90%, highlighting risks of cognitive monoculture.
Human-in-the-Loop exploits operator automation bias, where humans trust malicious agent outputs without detection, a cognitive rather than software problem.
Key defenses include HTML sanitization, strict network routing, data provenance tracking, circuit breakers, and red-teaming against the full taxonomy.
The gap between behavioral instructions (system prompts) and architectural defenses is critical, as prompts alone cover only about 10% of the attack surface.

Hasty Briefsbeta