What happened after 2k people tried to hack my AI assistant

6 hours ago

Built hackmyclaw.com, an email-based challenge where users tried to trick AI assistant Fiu into leaking a secrets.env file.
Fiu received over 6,000 emails from 2,000+ people, including sophisticated attacks like impersonation and multi-language social engineering, but the secret never leaked.
The project was powered by a simple security prompt on a VPS, with rules against revealing secrets, file modifications, or executing commands from emails.
Experiment faced challenges: Gmail suspension due to high email volume, over $500 in API costs, and batch processing affecting agent's suspicion levels.
Key findings: Prompt injection was harder than expected; robust instruction-following with a powerful model like Opus 4.6 provided strong resistance.
Despite the success, concerns remain about AI assistants' security due to their access to sensitive data and the potential risks of repeated interactions.
Unexpected outcome: Sponsors reached out to support the project, covering increased prizes and API costs.
Future considerations: Testing with infinite credits for ongoing conversations and weaker models to explore vulnerabilities, especially in non-English languages.
Overall, the experiment increased optimism about AI security but highlighted that prompt injection remains a real threat.

Hasty Briefsbeta