Core Concepts
This work proposes the Proxy variable Pessimistic Policy Optimization (P3O) algorithm, which provably addresses the challenges of confounding bias and distributional shift in offline reinforcement learning for partially observable Markov decision processes.
Abstract
The content discusses the problem of offline reinforcement learning (RL) in partially observable Markov decision processes (POMDPs), where the dataset only contains partial observations of the states. The key challenges are the confounding bias caused by the latent states simultaneously affecting the actions and observations, and the distributional shift between the behavior policy and the target policies.
To address these challenges, the authors propose the Proxy variable Pessimistic Policy Optimization (P3O) algorithm, which leverages tools from proximal causal inference to identify the value of each policy. Specifically:
P3O identifies the policy value using confounding bridge functions that satisfy a sequence of backward conditional moment equations, similar to the Bellman equations in classical RL.
P3O estimates these bridge functions via minimax estimation, and constructs a sequence of novel confidence regions to handle the distributional shift.
P3O then applies the pessimism principle to learn the optimal policy within the confidence regions.
Under a partial coverage assumption on the confounded dataset, the authors prove that P3O achieves a $\tilde{O}(n^{-1/2})$ suboptimality, where $n$ is the number of trajectories in the dataset. This establishes the first provably efficient offline RL algorithm for POMDPs with confounded data.