Core Concepts
The paper establishes global optimality and convergence guarantees for a two-timescale actor-critic algorithm with representation learning, where the critic is represented by an overparameterized neural network and is updated via temporal-difference learning, while the actor is updated via proximal policy optimization. The analysis is conducted in the mean-field limit regime, where the neural network width goes to infinity and the updates are studied in continuous time.
Abstract
The paper analyzes a two-timescale actor-critic (AC) algorithm for reinforcement learning, where the critic is represented by an overparameterized neural network and is updated via temporal-difference (TD) learning, while the actor is updated via proximal policy optimization (PPO).
Key highlights:
In the continuous-time and infinite-width limiting regime, the critic update is captured by a Wasserstein semigradient flow, while the actor update is connected to replicator dynamics.
The separation of timescales between the actor and critic updates plays a crucial role in the convergence analysis.
The paper establishes global optimality and sublinear convergence guarantees for the two-timescale AC algorithm.
The feature representation induced by the critic is allowed to evolve within a neighborhood of the initial representation, in contrast to the neural tangent kernel regime where the features are fixed.
A restarting mechanism is introduced to ensure the critic's feature representation stays close to the initial one, which is essential for the theoretical guarantees.
The analysis combines tools from variational inequalities, mean-field theory of neural networks, and two-timescale stochastic approximation.
Stats
The paper does not contain any explicit numerical metrics or figures. The analysis is theoretical in nature.
Quotes
The paper does not contain any striking quotes.