Finding vulnerabilities in Python web apps using Claude Code and OpenAI Codex

8 days ago

Copy Link

AI coding agents (Claude Code and OpenAI Codex) were tested for finding vulnerabilities in 11 large Python web applications.
Claude Code found 46 vulnerabilities (14% true positive rate, 86% false positive rate).
OpenAI Codex found 21 vulnerabilities (18% true positive rate, 82% false positive rate).
Claude Code performed best at finding IDOR bugs (22% true positive rate) but struggled with SQL Injection (5% true positive rate) and XSS (16% true positive rate).
OpenAI Codex performed poorly on IDOR (0% true positive rate), SQL Injection (0% true positive rate), and XSS (0% true positive rate) but did better on Path Traversal (47% true positive rate).
Non-determinism was observed: identical runs on the same codebase yielded different results.
The study highlighted the high false positive rates and the challenges of using AI for vulnerability detection in real-world applications.
The research emphasized the need for better benchmarks and scaffolding to improve AI-based vulnerability detection.

Hasty Briefsbeta