What happened after 2k people tried to hack my AI assistant
6 hours ago
- #AI Security
- #Email Challenge
- #Prompt Injection
- Built hackmyclaw.com, an email-based challenge where users tried to trick AI assistant Fiu into leaking a secrets.env file.
- Fiu received over 6,000 emails from 2,000+ people, including sophisticated attacks like impersonation and multi-language social engineering, but the secret never leaked.
- The project was powered by a simple security prompt on a VPS, with rules against revealing secrets, file modifications, or executing commands from emails.
- Experiment faced challenges: Gmail suspension due to high email volume, over $500 in API costs, and batch processing affecting agent's suspicion levels.
- Key findings: Prompt injection was harder than expected; robust instruction-following with a powerful model like Opus 4.6 provided strong resistance.
- Despite the success, concerns remain about AI assistants' security due to their access to sensitive data and the potential risks of repeated interactions.
- Unexpected outcome: Sponsors reached out to support the project, covering increased prizes and API costs.
- Future considerations: Testing with infinite credits for ongoing conversations and weaker models to explore vulnerabilities, especially in non-English languages.
- Overall, the experiment increased optimism about AI security but highlighted that prompt injection remains a real threat.