Core Concepts
The core message of this article is to propose an efficient algorithm named CAESAR for simultaneously evaluating the performance of multiple target reinforcement learning policies, by computing an approximate optimal offline sampling distribution and using the data sampled from it to estimate the policy values.
Abstract
The article focuses on the problem of multiple-policy evaluation in reinforcement learning, where the goal is to estimate the expected total rewards of a set of K target policies to a given accuracy with high probability.
The key highlights and insights are:
The authors propose an algorithm called CAESAR that consists of two main phases:
In the first phase, they produce coarse estimates of the visitation distributions of the target policies at a low order sample complexity rate.
In the second phase, they approximate the optimal offline sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function.
CAESAR achieves a sample complexity that scales with the maximum squared visitation probabilities of the target policies under the optimal sampling distribution, rather than the maximum visitation probabilities as in previous work. This leads to significant improvements in sample efficiency.
The authors leverage techniques like coarse distribution estimation, optimal sampling distribution computation, and step-wise loss function minimization to derive the non-asymptotic sample complexity guarantees for CAESAR.
CAESAR consistently outperforms the naive uniform sampling strategy over target policies, and in some cases also improves upon the previous state-of-the-art results.
The authors also provide additional results on estimating the importance weighting ratios using a novel step-wise loss function, which may be of independent interest beyond the specific multiple-policy evaluation problem.
Stats
The article does not contain any explicit numerical data or metrics. The key figures are the sample complexity bounds derived for the CAESAR algorithm.