The content introduces the Δ-OPE task, which focuses on estimating the difference in value between a target recommendation policy and a production policy, rather than estimating the value of each policy individually.
The key insight is that if the target and production policies have positive covariance, the difference in their values can often be estimated with lower variance than estimating each value separately. This leads to improved statistical power in off-policy evaluation scenarios and better learned recommendation policies in off-policy learning scenarios.
The authors derive unbiased Δ-OPE estimators based on Inverse Propensity Scoring (IPS), Self-Normalised IPS (SNIPS), and additive control variates (Δβ-IPS). They characterize the variance-optimal additive control variate for the Δβ-IPS estimator.
Experiments on both simulated and real-world data demonstrate that the Δ-OPE estimator family significantly improves performance compared to traditional pointwise OPE methods, with Δβ-IPS consistently performing the best.
Ke Bahasa Lain
dari konten sumber
arxiv.org
Wawasan Utama Disaring Dari
by Olivier Jeun... pada arxiv.org 09-17-2024
https://arxiv.org/pdf/2405.10024.pdfPertanyaan yang Lebih Dalam