Core Concepts
Overparameterized two-layer neural networks enable temporal-difference (TD) and Q-learning to globally minimize the mean-squared projected Bellman error and learn an optimal feature representation.
Abstract
The content discusses the ability of overparameterized two-layer neural networks to enable temporal-difference (TD) and Q-learning to learn an optimal feature representation and globally minimize the mean-squared projected Bellman error.
Key highlights:
Deep reinforcement learning uses expressive neural networks to parameterize policies and value functions, inducing a data-dependent feature representation.
A fundamental challenge is that the evolving feature representation can lead to the divergence of TD and Q-learning.
Previous analyses in the neural tangent kernel (NTK) regime showed that TD can converge to the globally optimal solution, but the feature representation is constrained to an infinitesimal neighborhood of the initial one.
This work goes beyond the NTK regime and shows that overparameterized two-layer neural networks enable TD and Q-learning to globally minimize the mean-squared projected Bellman error and learn an optimal feature representation.
The key is a mean-field perspective that connects the evolution of the finite-dimensional parameter to its limiting counterpart over an infinite-dimensional Wasserstein space.
The analysis is extended to soft Q-learning, which is equivalent to policy gradient.