Reinforcement Learning in Partially Observable Markov Decision Processes with Provable Sample Efficiency
This paper proposes a reinforcement learning algorithm (OP-TENET) that attains an ε-optimal policy within O(1/ε^2) episodes for a class of partially observable Markov decision processes (POMDPs) with a linear structure. The sample complexity of OP-TENET scales polynomially in the intrinsic dimension of the linear structure and is independent of the size of the observation and state spaces.