Finding vulnerabilities in Python web apps using Claude Code and OpenAI Codex
8 days ago
- #Python Web Applications
- #AI Security
- #Vulnerability Detection
- AI coding agents (Claude Code and OpenAI Codex) were tested for finding vulnerabilities in 11 large Python web applications.
- Claude Code found 46 vulnerabilities (14% true positive rate, 86% false positive rate).
- OpenAI Codex found 21 vulnerabilities (18% true positive rate, 82% false positive rate).
- Claude Code performed best at finding IDOR bugs (22% true positive rate) but struggled with SQL Injection (5% true positive rate) and XSS (16% true positive rate).
- OpenAI Codex performed poorly on IDOR (0% true positive rate), SQL Injection (0% true positive rate), and XSS (0% true positive rate) but did better on Path Traversal (47% true positive rate).
- Non-determinism was observed: identical runs on the same codebase yielded different results.
- The study highlighted the high false positive rates and the challenges of using AI for vulnerability detection in real-world applications.
- The research emphasized the need for better benchmarks and scaffolding to improve AI-based vulnerability detection.