Core Concepts
This work extends provably correct reinforcement learning algorithms for low-rank Markov Decision Processes (MDPs) to settings with continuous action spaces, without requiring discretization of the action space.
Abstract
The paper discusses the limitations of existing provably correct (PAC) reinforcement learning algorithms for low-rank MDPs, which exhibit a super-linear dependence on the size of the action space |A|. This makes them unsuitable for applications with large or continuous action spaces.
To address this, the authors propose two strategies:
Utilizing smoothness of error functions: The authors provide a generalization of the importance sampling lemma used in these analyses, which can be applied to α-smooth functions. This allows bounding errors under any policy, not just uniform exploration, when the transition function errors are smooth in the actions.
Using smoothed policies: The authors show that if the policies being evaluated have a uniformly bounded density ratio, then the importance sampling lemma can be applied without the |A| dependence. They further provide a way to extend this to unrestricted policies when the transition function and reward function are Hölder continuous.
As a case study, the authors apply these techniques to the FLAMBE algorithm, a seminal PAC RL algorithm for low-rank MDPs. They provide two PAC bounds for FLAMBE in the continuous action setting - one for restricted policies with bounded density ratio, and one for unrestricted policies under additional smoothness assumptions.
The key insights are that leveraging smoothness in the problem parameters can significantly improve the sample complexity scaling with action space size, and that restricting to policies with bounded density ratio provides a simple way to avoid the |A| dependence in certain parts of the analysis.
Stats
The number of trajectories collected by the FLAMBE algorithm in the continuous action setting is:
For restricted policies:
˜
O(H^2 * K^(5+4τ) * d^(7+4τ) / ε^(10+8τ) * L^(9+8τ)κ)
For unrestricted policies:
˜
O(H^2 * d^(7+4τ+(4τ+5)σ) / ε^(10+8τ+(4τ+5)σ) * L^(9+8τ)κ * L^(4τ+5)σ)
Where:
τ = m/αE, with αE the smoothness order of transition errors
κ = m/(m + αE)
σ = m/min(αT, αR), with αT the smoothness order of true transitions and αR the smoothness order of rewards