High-Probability Sample Complexity Bounds for Off-Policy Policy Evaluation with Linear Function Approximation
This paper establishes the first high-probability sample complexity bounds for two widely-used policy evaluation algorithms - the temporal difference (TD) learning algorithm in the on-policy setting, and the two-timescale linear TD with gradient correction (TDC) algorithm in the off-policy setting. The bounds match the minimax-optimal dependence on the target accuracy level and provide explicit dependence on problem-related parameters.