Can You Trust Code Copilots? Evaluating LLMs from a Code Security Perspec
a year ago
- #Benchmark
- #Code Security
- #LLM
- Proposes CoV-Eval, a multi-task benchmark for evaluating LLM code security across tasks like code completion, vulnerability repair, detection, and classification.
- Introduces VC-Judge, an improved judgment model aligning with human experts to review LLM-generated programs for vulnerabilities more efficiently and reliably.
- Evaluates 20 proprietary and open-source LLMs, finding they identify vulnerable codes well but struggle with generating secure codes and recognizing specific vulnerability types.
- Highlights key challenges and optimization directions for future research in LLM code security through extensive experiments and qualitative analyses.