Hasty Briefsbeta

Bilingual

Golden Sets: Regression Engineering for Probabilistic Systems

a day ago
  • #AI quality assurance
  • #regression testing
  • #probabilistic systems
  • Golden sets are unit tests for probabilistic behavior, preventing quality regressions.
  • They include curated cases, versioned rubrics, and gates to ensure quality.
  • Golden sets turn subjective improvements into measurable ones.
  • Key components: representative inputs, explicit expectations, rubrics, pinned scoring versions, and acceptance thresholds.
  • Golden sets help discover regressions before they reach production.
  • Each case in a golden set should include input, constraints, expected outcomes, assertions, and metadata.
  • Golden sets are essential for workflows with production consequences.
  • Common failure modes: demo-case optimism, metric collapse, change-surface blindness, stale sets, judge drift, and missing negative cases.
  • Implementation steps: start with behavior classes, use deterministic assertions, apply rubrics, slice by change surface, and add cases from incidents.
  • Golden sets should feed into evaluation gates for shipping decisions, focusing on multi-metric checks.
  • They are most effective when paired with traces to explain regressions.