Hasty Briefsbeta

Bilingual

N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

8 hours ago
  • #vulnerability discovery
  • #cybersecurity
  • #benchmark
  • N-Day-Bench measures LLMs' ability to discover real-world vulnerabilities disclosed after their knowledge cut-off.
  • The benchmark is adaptive, with monthly updates to test cases and model versions.
  • All traces are publicly accessible, providing transparency into the evaluation process.
  • Leading models include GPT-5.4, GLM-5.1, and Claude Opus-4.6, with GPT-5.4 achieving the highest average score of 83.93.