The paper presents a new doubly-robust off-policy evaluation (OPE) estimator called DRUnknown for Markov decision processes. The key highlights are:
DRUnknown is designed for situations where both the logging policy and the value function are unknown. It simultaneously estimates the logging policy model and the value function model.
When the logging policy model is correctly specified, DRUnknown achieves the smallest asymptotic variance within the class of existing OPE estimators that use an estimated logging policy.
When the value function model is also correctly specified, DRUnknown is optimal as its asymptotic variance reaches the semiparametric lower bound.
The authors derive the influence function of the proposed estimator and use it to estimate the parameters that minimize the asymptotic variance.
Experiments on contextual bandits and reinforcement learning problems show that DRUnknown consistently outperforms existing methods in terms of mean-squared error.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor