Core Concepts
The paper proposes an Optimistic variant of the Proximal Policy Optimization (OPPO) algorithm that incorporates exploration in a principled manner, and proves that it achieves a √d²H³T-regret in episodic Markov decision processes with linear function approximation and adversarial rewards.
Abstract
The paper addresses the challenge of designing a provably efficient policy optimization algorithm that incorporates exploration, which remains elusive compared to the theoretical understanding of value-based reinforcement learning.
Key highlights:
The paper proposes the Optimistic Proximal Policy Optimization (OPPO) algorithm, which follows an "optimistic version" of the policy gradient direction.
OPPO solves a KL-regularized policy optimization subproblem at each update, where the linear component of the objective is defined using an estimated action-value function augmented with an exploration bonus.
The exploration bonus quantifies the uncertainty in estimating the action-value function based on finite historical data, ensuring conservative optimism in the face of uncertainty and adversary.
The paper proves that OPPO achieves a √d²H³T-regret (up to logarithmic factors) in episodic Markov decision processes with linear function approximation and adversarial rewards, without requiring access to a simulator or finite concentrability coefficients.
This is the first provably sample-efficient policy optimization algorithm that incorporates exploration, complementing the existing computational efficiency guarantees of policy optimization algorithms.
Stats
The paper does not provide specific numerical metrics or figures to support the key claims. The analysis is focused on establishing theoretical regret bounds.