Core Concepts
Misspecification uncertainties must be accounted for in underparametrized regression models to avoid severe underestimation of parameter uncertainties.
Abstract
The content discusses the challenges of misspecification in deterministic, underparametrized regression models, where simulation engines have vanishing aleatoric uncertainty and large quantities of training data are available.
Key highlights:
Minimizing the expected loss (log likelihood) ignores misspecification, leading to vanishing parameter uncertainties in the underparametrized limit.
The generalization error, which measures the cross-entropy between predicted and observed data distributions, diverges as 1/ϵ^2 under misspecification for the minimum loss solution.
To avoid this divergence, the parameter distribution must have mass in every pointwise optimal parameter set (POPS) for each training point.
An ensemble ansatz is proposed that respects this POPS covering constraint and can be efficiently evaluated for linear models through rank-one updates.
The POPS-constrained ensemble provides robust bounds on test errors, outperforming standard Bayesian regression approaches, and is demonstrated on challenging high-dimensional datasets from atomic machine learning.
Stats
The content does not provide specific numerical data, but discusses the following key figures:
The generalization error of the minimum loss solution diverges as 1/ϵ^2 under misspecification.
The POPS-constrained ensemble approach provides almost perfect bounding of test errors, with envelope violation rates dropping from 40% to 10% as N/P increases from 1 to 50.
For the atomic machine learning application, the mean ratio of the minimum loss model residual to the ensemble envelope width drops from 0.45 at N/P=1 to 0.25 at N/P=50.
Quotes
The content does not contain any direct quotes that are critical to the key arguments.