Core Concepts
The authors propose efficient reinforcement learning algorithms, LSVI-UCB-Restart and Ada-LSVI-UCB-Restart, that can adapt to nonstationary environments with linear function approximation. They derive minimax dynamic regret lower bounds and provide matching upper bounds for their proposed algorithms.
Abstract
The paper considers reinforcement learning (RL) in episodic Markov decision processes (MDPs) with linear function approximation under a drifting environment. Specifically, both the reward and state transition functions can evolve over time, but their total variations are bounded.
The key contributions are:
The authors derive minimax dynamic regret lower bounds for nonstationary linear MDPs, showing that it is impossible for any algorithm to achieve sublinear regret on any nonstationary linear MDP with total variation linear in the time horizon T. They also establish a minimax regret lower bound for stationary linear MDPs.
They propose the LSVI-UCB-Restart algorithm, which is an optimistic modification of least-squares value iteration with periodic restart. They analyze its dynamic regret bounds for the cases when local variations are known or unknown.
They introduce a parameter-free algorithm called Ada-LSVI-UCB-Restart, which extends LSVI-UCB-Restart to handle unknown variation budgets. They prove that it can achieve near-optimal dynamic regret without knowing the total variations.
Numerical experiments on synthetic nonstationary linear MDPs demonstrate the effectiveness of the proposed algorithms.
Stats
The paper does not provide any specific numerical data or statistics to support the key claims. The analysis is primarily theoretical, focusing on deriving regret bounds.