Core Concepts
We present two policy gradient-based algorithms that achieve improved regret bounds compared to the state-of-the-art for infinite horizon average reward Markov Decision Processes with general policy parameterization.
Abstract
The paper presents two policy gradient-based algorithms for solving infinite horizon average reward Markov Decision Processes (MDPs) with general policy parameterization.
Key highlights:
The first algorithm employs implicit gradient transport for variance reduction, achieving a regret bound of order ~O(T^{3/5}).
The second algorithm utilizes a Hessian-based technique, attaining a regret bound of order ~O(√T), which is optimal.
These results significantly improve upon the existing state-of-the-art regret bound of ~O(T^{3/4}).
The authors address technical challenges in adapting variance reduction techniques from the discounted reward setup to the average reward setup.
The algorithms only require sampling a single trajectory per iteration and have similar memory and computational complexity to Hessian-free methods.
The paper first provides an overview of related work in policy gradient methods for discounted reward MDPs and model-based/tabular approaches for average reward MDPs. It then introduces the problem setup and key assumptions.
The two proposed algorithms are then described in detail:
Parameterized Policy Gradient with Implicit Gradient Transport:
Utilizes implicit gradient transport for variance reduction without requiring importance sampling or curvature information.
Achieves a regret bound of order ~O(T^{3/5}).
Parameterized Hessian-aided Policy Gradient:
Incorporates second-order information via Hessian estimates.
Attains a regret bound of order ~O(√T), which is optimal.
The paper provides a detailed proof outline, establishing key lemmas on the properties of the gradient and Hessian estimators, as well as bounds on the expected regret for both algorithms.