Hasty Briefsbeta

Bilingual

Counterfactual evaluation for recommendation systems

4 months ago
  • #counterfactual evaluation
  • #recommendation systems
  • #machine learning
  • Offline evaluation of recommendation systems treats them as observational problems when they are actually interventional.
  • Traditional metrics like recall, precision, and NDCG evaluate how well recommendations fit logged data, not their actual impact on user behavior.
  • A/B testing is a direct but resource-intensive method for evaluating recommendations as interventional problems.
  • Counterfactual evaluation, particularly Inverse Propensity Scoring (IPS), estimates the outcomes of potential A/B tests without running them.
  • IPS reweights logged rewards based on how often items are recommended by new versus old models.
  • Challenges with IPS include insufficient support (zero probability recommendations) and high variance from large recommendation probability differences.
  • Clipped IPS (CIPS) and Self-Normalized IPS (SNIPS) are solutions to mitigate high variance in IPS, with SNIPS performing best in experiments.
  • SNIPS requires computing importance weights for all observations, increasing storage and computation but offering faster convergence.
  • Despite its limitations, observational evaluation remains useful for its established framework and ease of data collection.
  • Counterfactual evaluation via SNIPS is recommended when offline metrics diverge from online A/B testing outcomes or for simulating A/B tests offline.