Concepts de base
The core message of this article is that by exploiting the low-rank structure in the state transition of POMDPs, it is possible to learn a minimal but sufficient representation of the observation and state histories, enabling sample-efficient reinforcement learning in POMDPs with infinite observation and state spaces.
Résumé
The article proposes a reinforcement learning algorithm called Embed to Control (ETC) that learns the representation at two levels:
- For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel.
- Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature.
The key insights are:
- The low-rank structure in the state transition allows for efficient representation learning and reinforcement learning.
- The future and past sufficiency assumptions ensure that the density of the state can be identified from the density of the future and past observations, respectively.
- ETC balances exploitation and exploration by constructing a confidence set of embeddings and conducting optimistic planning.
- ETC achieves an O(1/ε^2) sample complexity that scales polynomially with the horizon and the intrinsic dimension (the rank of the transition), bypassing the exponential dependence on the sizes of the observation and state spaces.
Stats
The article does not provide any specific numerical data or metrics. It focuses on the theoretical analysis of the proposed algorithm.