Golden Sets: Regression Engineering for Probabilistic Systems

a day ago

Golden sets are unit tests for probabilistic behavior, preventing quality regressions.
They include curated cases, versioned rubrics, and gates to ensure quality.
Golden sets turn subjective improvements into measurable ones.
Key components: representative inputs, explicit expectations, rubrics, pinned scoring versions, and acceptance thresholds.
Golden sets help discover regressions before they reach production.
Each case in a golden set should include input, constraints, expected outcomes, assertions, and metadata.
Golden sets are essential for workflows with production consequences.
Common failure modes: demo-case optimism, metric collapse, change-surface blindness, stale sets, judge drift, and missing negative cases.
Implementation steps: start with behavior classes, use deterministic assertions, apply rubrics, slice by change surface, and add cases from incidents.
Golden sets should feed into evaluation gates for shipping decisions, focusing on multi-metric checks.
They are most effective when paired with traces to explain regressions.

Hasty Briefsbeta