The authors study the global linear convergence of policy gradient methods for finite-horizon continuous-time exploratory linear-quadratic control problems, proposing geometry-aware gradient descents and proving robust linear convergence.
The authors present the first finite-time global convergence analysis of policy gradient in average reward Markov decision processes, proving that the algorithm converges for average-reward MDPs with sublinear regret. Their primary contribution lies in obtaining finite-time performance guarantees.