Golden Sets: Regression Engineering for Probabilistic Systems
a day ago
- #AI quality assurance
- #regression testing
- #probabilistic systems
- Golden sets are unit tests for probabilistic behavior, preventing quality regressions.
- They include curated cases, versioned rubrics, and gates to ensure quality.
- Golden sets turn subjective improvements into measurable ones.
- Key components: representative inputs, explicit expectations, rubrics, pinned scoring versions, and acceptance thresholds.
- Golden sets help discover regressions before they reach production.
- Each case in a golden set should include input, constraints, expected outcomes, assertions, and metadata.
- Golden sets are essential for workflows with production consequences.
- Common failure modes: demo-case optimism, metric collapse, change-surface blindness, stale sets, judge drift, and missing negative cases.
- Implementation steps: start with behavior classes, use deterministic assertions, apply rubrics, slice by change surface, and add cases from incidents.
- Golden sets should feed into evaluation gates for shipping decisions, focusing on multi-metric checks.
- They are most effective when paired with traces to explain regressions.