CVE-Bench: testing LLM agents on real-world vulnerability patches
4 hours ago
- #CVE Fixing Evaluation
- #AI Security Benchmarking
- #LLM Agent Performance
- CVE-Bench is a benchmark for evaluating LLM agents on fixing real-world security vulnerabilities in Python projects.
- It includes 20 CVEs from 2025-2026, spanning 15 CWE categories and 18 projects like Pillow, GitPython, and urllib3.
- Agents operate in sandboxed containers with tools for reading, editing, and testing code, but no shell access to prevent cheating.
- Three prompt conditions test different capabilities: 'Advisory' (full advisory), 'Diagnose' (behavioral description only), and 'Locate' (file/function only).
- Five models were tested: OpenAI's gpt-5.4-mini, gpt-5.4-nano, gpt-5.5, and Poolside's laguna-m.1, laguna-xs.2.
- Results show no model reliably fixes vulnerabilities; best solve rate is 50% overall, with gpt-5.5 performing highest but not significantly better than smaller OpenAI models.
- OpenAI models are statistically indistinguishable within family, but outperform Poolside models, which are also indistinguishable from each other.
- Cost efficiency varies widely, with gpt-5.4-mini being most efficient (100k input tokens) and Laguna models using 4× more tokens for similar outcomes.
- Failure modes include wrong-search drift, budget exhaustion mid-implementation, partial fixes, and fixing the wrong part of a vulnerability.
- The 'Locate' condition is the sharpest test of genuine security reasoning, but models still struggle, indicating gaps in independent vulnerability recognition.
- Limitations include potential dataset contamination, narrow task scope (Python-only, compact fixes), and underpowered within-family comparisons.
- Building the benchmark revealed challenges like maintainers patching without tests, complex environment setups, and high inference costs.