Provably Efficient Exploration in Policy Optimization for Markov Decision Processes with Linear Function Approximation and Adversarial Rewards
The paper proposes an Optimistic variant of the Proximal Policy Optimization (OPPO) algorithm that incorporates exploration in a principled manner, and proves that it achieves a √d²H³T-regret in episodic Markov decision processes with linear function approximation and adversarial rewards.