toplogo
Sign In

Reducing Variance in Off-Policy Evaluation with State-Based Importance Sampling


Core Concepts
State-based importance sampling reduces the variance of off-policy evaluation by selectively dropping the action probability ratios of certain states from the importance weight computation.
Abstract
The paper proposes state-based importance sampling (SIS), a class of off-policy evaluation techniques that reduce the variance of importance sampling by eliminating states that do not affect the return from the importance weight computation. The key contributions are: Introduction of state-based importance sampling estimators that drop "negligible states" from the importance weight computation. Two methods to identify negligible states: one based on covariance testing and one based on state-action values. Implementation of state-based variants of ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stationary density ratio estimation. Empirical experiments demonstrating the performance of these state-based estimators compared to their traditional counterparts in four domains: deterministic lift, stochastic lift, inventory management, and taxi. The experiments show that state-based estimators consistently yield reduced variance and improved accuracy compared to their traditional counterparts, especially in domains with known "lift states" where certain states have no impact on the return.
Stats
The paper presents the following key figures and metrics: The variance upper bound of ordinary importance sampling is exponential in the horizon H: Var( ˆGIS) = O(exp(H)). The variance upper bound of state-based importance sampling is exponential in the maximal number of occurrences of states in the non-negligible set SB rather than the full horizon H: Var( ˆGSIS) = O(exp(MB)). In the deterministic lift domain, SIS has the best performance across all domain sizes, followed by IS and SPDIS. In the stochastic lift domain, WSIS and WSPDIS consistently outperform their traditional counterparts. For large domain sizes, WDRSIS is the best performing estimator. In the inventory management domain, state-based estimators yield 2- to 8-fold improvements in normalized mean squared error compared to their traditional counterparts. In the taxi domain, state-based estimators also show a near universal benefit, with SSDRE being the highest performer for effective horizons between 50 and 1,000.
Quotes
"State-based importance sampling (SIS) mitigates the above issue by constructing an estimator ˆGSIS that selectively drops the action probability ratios of a select state set SA ⊂S from the product to compute the importance weight." "Any such set SA is called an ϵ-negligible state set for the off-policy evaluation problem ⟨M, πe, πb⟩."

Deeper Inquiries

How can the state-based importance sampling framework be extended to handle continuous state spaces or high-dimensional state representations

To extend the state-based importance sampling framework to handle continuous state spaces or high-dimensional state representations, we can utilize function approximation techniques such as neural networks or kernel methods. By representing the state space using a continuous function approximator, we can estimate the importance weights for each state based on their continuous representations. This approach allows us to generalize the concept of negligible states to continuous spaces by defining a threshold or distance metric to determine which states can be considered negligible. Additionally, techniques like kernel density estimation or Gaussian processes can be used to estimate the state visitation distributions required for importance sampling in continuous spaces.

What are the theoretical guarantees on the bias and variance of state-based importance sampling estimators under different assumptions about the MDP and the negligible state set

Theoretical guarantees on the bias and variance of state-based importance sampling estimators depend on the assumptions made about the Markov Decision Process (MDP) and the negligible state set. Under the assumption of an accurate identification of negligible states, the bias of the estimator is limited by the covariance between the importance weight and the return estimate. If the state set identification is accurate, the bias is reduced, leading to a lower mean squared error (MSE). The variance of the estimator is also reduced compared to traditional importance sampling, especially when the maximal number of steps in non-negligible states is significantly smaller than the horizon. However, if the state set identification is inaccurate, the bias and variance of the estimator may increase, leading to a higher MSE.

Can state-based importance sampling be combined with other variance reduction techniques, such as control variates or multi-level Monte Carlo, to further improve the efficiency of off-policy evaluation

State-based importance sampling can be combined with other variance reduction techniques such as control variates or multi-level Monte Carlo to further improve the efficiency of off-policy evaluation. Control variates can be used to reduce the variance of the estimator by introducing a correlated variable that helps in estimating the expected return more accurately. By incorporating control variates into the state-based importance sampling framework, we can reduce the variance of the estimator and improve its efficiency. Similarly, multi-level Monte Carlo techniques can be applied to state-based importance sampling to reduce the computational cost of estimating the expected return by using samples at different levels of accuracy. This combination can lead to a more efficient and accurate off-policy evaluation process.
0