Hasty Briefsbeta

Bilingual

Sir-Bench – benchmark for security incident response agents

9 hours ago
  • #Security Incident Response
  • #Benchmark Evaluation
  • #Forensic Investigation
  • SIR-Bench is a benchmark with 794 test cases for evaluating autonomous security incident response agents.
  • It distinguishes genuine forensic investigation from alert parroting using expert-validated ground truth from 129 anonymized incident patterns.
  • The benchmark measures triage accuracy (M1), novel finding discovery (M2), and tool usage appropriateness (M3) through an adversarial LLM-as-Judge.
  • Evaluation of their SIR agent shows 97.1% true positive detection, 73.4% false positive rejection, and 5.67 novel key findings per case.
  • Once Upon A Threat (OUAT) framework replays real incidents in controlled cloud environments to produce authentic telemetry.