N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?
8 hours ago
- #vulnerability discovery
- #cybersecurity
- #benchmark
- N-Day-Bench measures LLMs' ability to discover real-world vulnerabilities disclosed after their knowledge cut-off.
- The benchmark is adaptive, with monthly updates to test cases and model versions.
- All traces are publicly accessible, providing transparency into the evaluation process.
- Leading models include GPT-5.4, GLM-5.1, and Claude Opus-4.6, with GPT-5.4 achieving the highest average score of 83.93.