Can You Trust Code Copilots? Evaluating LLMs from a Code Security Perspec

a year ago

Proposes CoV-Eval, a multi-task benchmark for evaluating LLM code security across tasks like code completion, vulnerability repair, detection, and classification.
Introduces VC-Judge, an improved judgment model aligning with human experts to review LLM-generated programs for vulnerabilities more efficiently and reliably.
Evaluates 20 proprietary and open-source LLMs, finding they identify vulnerable codes well but struggle with generating secure codes and recognizing specific vulnerability types.
Highlights key challenges and optimization directions for future research in LLM code security through extensive experiments and qualitative analyses.

Hasty Briefsbeta