Sir-Bench – benchmark for security incident response agents

9 hours ago

SIR-Bench is a benchmark with 794 test cases for evaluating autonomous security incident response agents.
It distinguishes genuine forensic investigation from alert parroting using expert-validated ground truth from 129 anonymized incident patterns.
The benchmark measures triage accuracy (M1), novel finding discovery (M2), and tool usage appropriateness (M3) through an adversarial LLM-as-Judge.
Evaluation of their SIR agent shows 97.1% true positive detection, 73.4% false positive rejection, and 5.67 novel key findings per case.
Once Upon A Threat (OUAT) framework replays real incidents in controlled cloud environments to produce authentic telemetry.

Hasty Briefsbeta