Hasty Briefsbeta

Bilingual

CVE-Bench: testing LLM agents on real-world vulnerability patches

4 hours ago
  • #CVE Fixing Evaluation
  • #AI Security Benchmarking
  • #LLM Agent Performance
  • CVE-Bench is a benchmark for evaluating LLM agents on fixing real-world security vulnerabilities in Python projects.
  • It includes 20 CVEs from 2025-2026, spanning 15 CWE categories and 18 projects like Pillow, GitPython, and urllib3.
  • Agents operate in sandboxed containers with tools for reading, editing, and testing code, but no shell access to prevent cheating.
  • Three prompt conditions test different capabilities: 'Advisory' (full advisory), 'Diagnose' (behavioral description only), and 'Locate' (file/function only).
  • Five models were tested: OpenAI's gpt-5.4-mini, gpt-5.4-nano, gpt-5.5, and Poolside's laguna-m.1, laguna-xs.2.
  • Results show no model reliably fixes vulnerabilities; best solve rate is 50% overall, with gpt-5.5 performing highest but not significantly better than smaller OpenAI models.
  • OpenAI models are statistically indistinguishable within family, but outperform Poolside models, which are also indistinguishable from each other.
  • Cost efficiency varies widely, with gpt-5.4-mini being most efficient (100k input tokens) and Laguna models using 4× more tokens for similar outcomes.
  • Failure modes include wrong-search drift, budget exhaustion mid-implementation, partial fixes, and fixing the wrong part of a vulnerability.
  • The 'Locate' condition is the sharpest test of genuine security reasoning, but models still struggle, indicating gaps in independent vulnerability recognition.
  • Limitations include potential dataset contamination, narrow task scope (Python-only, compact fixes), and underpowered within-family comparisons.
  • Building the benchmark revealed challenges like maintainers patching without tests, complex environment setups, and high inference costs.