Hasty Briefsbeta

Bilingual

Can You Trust Code Copilots? Evaluating LLMs from a Code Security Perspec

a year ago
  • #Benchmark
  • #Code Security
  • #LLM
  • Proposes CoV-Eval, a multi-task benchmark for evaluating LLM code security across tasks like code completion, vulnerability repair, detection, and classification.
  • Introduces VC-Judge, an improved judgment model aligning with human experts to review LLM-generated programs for vulnerabilities more efficiently and reliably.
  • Evaluates 20 proprietary and open-source LLMs, finding they identify vulnerable codes well but struggle with generating secure codes and recognizing specific vulnerability types.
  • Highlights key challenges and optimization directions for future research in LLM code security through extensive experiments and qualitative analyses.