Core Concepts
The author analyzes the statistical efficiency of distributional temporal difference algorithms, focusing on non-asymptotic perspectives for NTD and CTD.
Abstract
The content delves into the statistical analysis of distributional temporal difference algorithms, specifically NTD and CTD. It discusses convergence rates, sample complexities, and theoretical results in a detailed manner.
Distributional reinforcement learning (DRL) focuses on return distributions rather than just means.
NTD and CTD are key methodologies for distributional policy evaluation.
The paper provides non-asymptotic convergence rates for both NTD and CTD.
Sample complexities are analyzed to determine the number of iterations needed for optimal estimators.
Theoretical results are presented with detailed proofs and explanations.
Assumptions, propositions, lemmas, and references support the analytical framework.
Stats
In the case of NTD we need eO 1 ε2p(1−γ)2p+2 iterations to achieve an ε-optimal estimator with high probability when measured by p-Wasserstein distance.
Under some mild assumptions, eO 1 ε2(1−γ)4 iterations suffice to ensure the Kolmogorov-Smirnov distance between the NTD estimator ˆηπ and ηπ less than ε with high probability.