Core Concepts
Introducing a novel class of PMD algorithms with lookahead for improved convergence rates and scalability.
Abstract
The content introduces Policy Mirror Descent (PMD) as a versatile algorithmic framework for reinforcement learning. It discusses the incorporation of multi-step greedy policy improvement to enhance PMD, leading to faster convergence rates. The proposed h-PMD algorithm is detailed, along with its extensions to inexact settings and linear function approximation. Convergence analyses, sample complexities, simulations, and related works are also covered comprehensively.
Introduction
PMD as an algorithmic framework for RL.
Incorporation of multi-step greedy policy improvement.
Policy Iteration with Lookahead
Generalization of PI with h-PI.
Monotonic improvement guaranteed by h-greedy policies.
Policy Mirror Descent with Lookahead
Introduction of h-PMD algorithm.
Improved γh-linear convergence rate.
Inexact Policy Mirror Descent
Estimation of lookahead action values using Monte Carlo sampling.
Sample complexity analysis for inexact h-PMD.
Function Approximation
Extension to linear function approximation.
Convergence analysis for h-PMD with linear function approximation.
Simulations
Investigation of the effect of lookahead depth on convergence rate through simulations.
Related Work
Comparison with existing literature on PG methods and tree search methods.
Stats
"We propose an inexact version of h-PMD where lookahead action values are estimated."
"Our resulting sampling complexity only involves dependence on the dimension of the feature map space instead of the state space size."
Quotes
"We propose a novel class of algorithms called h-PMD enhancing PMD with multi-step greedy policy updates."
"Our experiments illustrate the convergence rate improvement of h-PMD with increasing lookahead depth."