This paper establishes the first high-probability sample complexity bounds for two widely-used policy evaluation algorithms - the temporal difference (TD) learning algorithm in the on-policy setting, and the two-timescale linear TD with gradient correction (TDC) algorithm in the off-policy setting. The bounds match the minimax-optimal dependence on the target accuracy level and provide explicit dependence on problem-related parameters.


coremsg

high-probability-sample-complexity-bounds-for-off-policy-policy-evaluation-with-linear-function-approximation


High-Probability Sample Complexity Bounds for Off-Policy Policy Evaluation with Linear Function Approximation