Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm
This paper introduces a novel policy gradient algorithm for infinite horizon average reward constrained Markov Decision Processes (CMDPs) with general parameterization, achieving sublinear regret and constraint violation bounds.