The paper investigates the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for policy evaluation in discounted infinite horizon Markov decision processes.
In the on-policy setting, where observations are generated from the target policy, the authors establish the first sample complexity bound with high-probability convergence guarantee for the temporal difference (TD) learning algorithm. The bound matches the minimax-optimal dependence on the tolerance level and exhibits an explicit dependence on problem-related quantities, including the choice of the feature map and the problem dimension.
In the off-policy setting, where samples are drawn from a behavior policy potentially different from the target policy, the authors establish a high-probability sample complexity bound for the two-timescale linear TD with gradient correction (TDC) algorithm. This is the first bound to achieve high-probability convergence guarantee with non-varying stepsizes and without using projection steps or batched updates. The bound also provides explicit dependence on problem-related parameters.
The authors also provide minimax lower bounds to assess the tightness of their upper bounds, showing that the performances of the TD and TDC learning algorithms cannot be further improved in the minimax sense other than a factor of 1/(1-γ) - the effective horizon.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問