Core Concepts
The authors study the convergence of gradient descent-ascent (GDA) algorithm and the representation learning of neural networks in solving minimax optimization problems defined over infinite-dimensional function classes, with a focus on functional conditional moment equations.
Abstract
The authors study the convergence of gradient descent-ascent (GDA) algorithm and the representation learning of neural networks in solving minimax optimization problems defined over infinite-dimensional function classes. As an initial step, they consider the minimax optimization problem stemming from estimating a functional equation defined by conditional expectations via adversarial estimation, where the objective function is quadratic in the functional space.
The key insights are:
In the mean-field regime, the GDA algorithm corresponds to a Wasserstein gradient flow over the space of probability measures defined over the neural network parameters.
They prove that the Wasserstein gradient flow converges globally to a stationary point of the minimax objective at a sublinear rate of O(1/T + 1/α), where T is the time horizon and α is the scaling parameter of the neural network.
They show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of O(1/α), measured in terms of the Wasserstein distance. This behavior is not captured by the neural tangent kernel (NTK) analysis, where the representation is shown to be fixed at the initialization.
When the regularization on the function f satisfies a version of strong convexity, they prove that the Wasserstein gradient flow converges to the global optimizer f* at a sublinear O(1/T + 1/α) rate.
They apply their general results to concrete examples including policy evaluation, nonparametric instrumental variable regression, asset pricing, and adversarial Riesz representer estimation.