Core Concepts
This paper proposes a reinforcement learning algorithm (OP-TENET) that attains an ε-optimal policy within O(1/ε^2) episodes for a class of partially observable Markov decision processes (POMDPs) with a linear structure. The sample complexity of OP-TENET scales polynomially in the intrinsic dimension of the linear structure and is independent of the size of the observation and state spaces.
Abstract
The paper studies reinforcement learning for POMDPs with infinite observation and state spaces, which remains less investigated theoretically. The authors propose the OP-TENET algorithm that achieves sample-efficient reinforcement learning in POMDPs with a linear structure.
Key highlights:
OP-TENET attains an ε-optimal policy within O(1/ε^2) episodes, with the sample complexity scaling polynomially in the intrinsic dimension of the linear structure and independent of the size of the observation and state spaces.
The sample efficiency of OP-TENET is enabled by: (i) a Bellman operator with finite memory, (ii) the identification and estimation of such an operator via an adversarial integral equation with a smoothed discriminator, and (iii) the exploration of the observation and state spaces via optimism based on quantifying the uncertainty in the adversarial integral equation.
The authors define a class of POMDPs with a linear structure and identify an ill-conditioning measure (the operator norm of the bridge operator) that quantifies the fundamental difficulty of reinforcement learning in the POMDP.
The theoretical analysis shows that the sample complexity of OP-TENET depends polynomially on the ill-conditioning measure, which is a key difference compared to sample complexity results in MDPs.
Stats
The paper does not contain any explicit numerical data or statistics. It focuses on the theoretical analysis of the proposed OP-TENET algorithm.