CVE-Bench: testing LLM agents on real-world vulnerability patches

4 hours ago

#CVE Fixing Evaluation
#AI Security Benchmarking
#LLM Agent Performance

CVE-Bench is a benchmark for evaluating LLM agents on fixing real-world security vulnerabilities in Python projects.
It includes 20 CVEs from 2025-2026, spanning 15 CWE categories and 18 projects like Pillow, GitPython, and urllib3.
Agents operate in sandboxed containers with tools for reading, editing, and testing code, but no shell access to prevent cheating.
Three prompt conditions test different capabilities: 'Advisory' (full advisory), 'Diagnose' (behavioral description only), and 'Locate' (file/function only).
Five models were tested: OpenAI's gpt-5.4-mini, gpt-5.4-nano, gpt-5.5, and Poolside's laguna-m.1, laguna-xs.2.
Results show no model reliably fixes vulnerabilities; best solve rate is 50% overall, with gpt-5.5 performing highest but not significantly better than smaller OpenAI models.
OpenAI models are statistically indistinguishable within family, but outperform Poolside models, which are also indistinguishable from each other.
Cost efficiency varies widely, with gpt-5.4-mini being most efficient (100k input tokens) and Laguna models using 4× more tokens for similar outcomes.
Failure modes include wrong-search drift, budget exhaustion mid-implementation, partial fixes, and fixing the wrong part of a vulnerability.
The 'Locate' condition is the sharpest test of genuine security reasoning, but models still struggle, indicating gaps in independent vulnerability recognition.
Limitations include potential dataset contamination, narrow task scope (Python-only, compact fixes), and underpowered within-family comparisons.
Building the benchmark revealed challenges like maintainers patching without tests, complex environment setups, and high inference costs.

Hasty Briefsbeta

CVE-Bench: testing LLM agents on real-world vulnerability patches