N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

8 hours ago

N-Day-Bench measures LLMs' ability to discover real-world vulnerabilities disclosed after their knowledge cut-off.
The benchmark is adaptive, with monthly updates to test cases and model versions.
All traces are publicly accessible, providing transparency into the evaluation process.
Leading models include GPT-5.4, GLM-5.1, and Claude Opus-4.6, with GPT-5.4 achieving the highest average score of 83.93.

Hasty Briefsbeta