Sir-Bench – benchmark for security incident response agents
9 hours ago
- #Security Incident Response
- #Benchmark Evaluation
- #Forensic Investigation
- SIR-Bench is a benchmark with 794 test cases for evaluating autonomous security incident response agents.
- It distinguishes genuine forensic investigation from alert parroting using expert-validated ground truth from 129 anonymized incident patterns.
- The benchmark measures triage accuracy (M1), novel finding discovery (M2), and tool usage appropriateness (M3) through an adversarial LLM-as-Judge.
- Evaluation of their SIR agent shows 97.1% true positive detection, 73.4% false positive rejection, and 5.67 novel key findings per case.
- Once Upon A Threat (OUAT) framework replays real incidents in controlled cloud environments to produce authentic telemetry.