Counterfactual evaluation for recommendation systems
4 months ago
- #counterfactual evaluation
- #recommendation systems
- #machine learning
- Offline evaluation of recommendation systems treats them as observational problems when they are actually interventional.
- Traditional metrics like recall, precision, and NDCG evaluate how well recommendations fit logged data, not their actual impact on user behavior.
- A/B testing is a direct but resource-intensive method for evaluating recommendations as interventional problems.
- Counterfactual evaluation, particularly Inverse Propensity Scoring (IPS), estimates the outcomes of potential A/B tests without running them.
- IPS reweights logged rewards based on how often items are recommended by new versus old models.
- Challenges with IPS include insufficient support (zero probability recommendations) and high variance from large recommendation probability differences.
- Clipped IPS (CIPS) and Self-Normalized IPS (SNIPS) are solutions to mitigate high variance in IPS, with SNIPS performing best in experiments.
- SNIPS requires computing importance weights for all observations, increasing storage and computation but offering faster convergence.
- Despite its limitations, observational evaluation remains useful for its established framework and ease of data collection.
- Counterfactual evaluation via SNIPS is recommended when offline metrics diverge from online A/B testing outcomes or for simulating A/B tests offline.