The content introduces the Δ-OPE task, which focuses on estimating the difference in value between a target recommendation policy and a production policy, rather than estimating the value of each policy individually.
The key insight is that if the target and production policies have positive covariance, the difference in their values can often be estimated with lower variance than estimating each value separately. This leads to improved statistical power in off-policy evaluation scenarios and better learned recommendation policies in off-policy learning scenarios.
The authors derive unbiased Δ-OPE estimators based on Inverse Propensity Scoring (IPS), Self-Normalised IPS (SNIPS), and additive control variates (Δβ-IPS). They characterize the variance-optimal additive control variate for the Δβ-IPS estimator.
Experiments on both simulated and real-world data demonstrate that the Δ-OPE estimator family significantly improves performance compared to traditional pointwise OPE methods, with Δβ-IPS consistently performing the best.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Olivier Jeun... في arxiv.org 09-17-2024
https://arxiv.org/pdf/2405.10024.pdfاستفسارات أعمق