Counterfactual evaluation for recommendation systems

4 months ago

Offline evaluation of recommendation systems treats them as observational problems when they are actually interventional.
Traditional metrics like recall, precision, and NDCG evaluate how well recommendations fit logged data, not their actual impact on user behavior.
A/B testing is a direct but resource-intensive method for evaluating recommendations as interventional problems.
Counterfactual evaluation, particularly Inverse Propensity Scoring (IPS), estimates the outcomes of potential A/B tests without running them.
IPS reweights logged rewards based on how often items are recommended by new versus old models.
Challenges with IPS include insufficient support (zero probability recommendations) and high variance from large recommendation probability differences.
Clipped IPS (CIPS) and Self-Normalized IPS (SNIPS) are solutions to mitigate high variance in IPS, with SNIPS performing best in experiments.
SNIPS requires computing importance weights for all observations, increasing storage and computation but offering faster convergence.
Despite its limitations, observational evaluation remains useful for its established framework and ease of data collection.
Counterfactual evaluation via SNIPS is recommended when offline metrics diverge from online A/B testing outcomes or for simulating A/B tests offline.

Hasty Briefsbeta