Core Concepts
Exploration in reinforcement learning is computationally harder than prediction in supervised learning, under a plausible cryptographic hardness assumption.
Abstract
The paper investigates the computational complexity of reinforcement learning (RL) compared to supervised learning (regression). It focuses on a specific class of Markov decision processes (MDPs) called block MDPs, where the observed states are stochastic emissions from a smaller latent state space.
The key insights are:
The paper constructs a family of block MDPs where reward-free RL (exploring the entire state space) is computationally harder than realizable regression (predicting labels given covariates), under a cryptographic hardness assumption.
The paper also shows that even in reward-directed RL, the natural regression oracle is not sufficient for computationally efficient learning, and some stronger oracle is necessary.
The technical proofs involve novel reductions between RL in block MDPs and variants of the Learning Parities with Noise (LPN) problem, a well-studied cryptographic hardness assumption. This includes showing the robustness of LPN to weakly dependent noise.
The results suggest that exploration, a core challenge in RL, is fundamentally harder than prediction, which is the focus of supervised learning. This provides a complexity-theoretic separation between these two modes of learning.
The paper also discusses special cases of block MDPs where RL is tractable given access to a regression oracle, highlighting the importance of understanding the structural assumptions that make RL computationally easier.