Core Concepts
Training large language models or generative AI models on their own synthesized outputs can lead to a phenomenon known as "model collapse", where the model's performance degrades over time until it becomes completely useless. This work provides a theoretical understanding of this phenomenon in the setting of high-dimensional supervised learning with kernel regression.
Abstract
The authors initiate a theoretical study of model collapse in the context of kernel regression. Their main findings can be summarized as follows:
-
Exact Characterization of Test Error: The authors obtain analytic formulae for the test error of a downstream model trained on n-fold fake data generation. This formula highlights the multiplicative degradation of the test error as the number of generations increases.
-
Modified Scaling Laws: In the case of power-law spectra, the authors obtain precise scaling laws that quantify the negative effect of training on fake data. They show that this leads to a crossover from the fast error rate in the noiseless regime to a much slower error rate that depends on the amount of true data used to train the fake data generator.
-
Optimal Regularization for Mitigating Collapse: The authors propose a corrected value of the regularization exponent that gracefully adapts to the presence of synthesized data, in contrast to the optimal value for the classical setting of training on real data.
The authors validate their theoretical results through experiments on both simulated data and real-world data (MNIST). The results demonstrate the effectiveness of their proposed strategies in mitigating the effects of model collapse.
Stats
The test error of the downstream model trained on n-fold fake data is given by Etest(wpred
n) ≃ Eclean
test + n × Δ, where Eclean
test is the usual test error when trained on clean data, and Δ depends on problem parameters like feature covariance matrix, sample size, strength of data-generator, and label noise levels.
In the power-law spectrum case, the test error scales as Etest(wpred
n) ≍ max(σ2, T^(1-2rℓ-ℓ/β)) × T^-(1-ℓ/β) + n(σ^2_0 / (1-ϕ_0)) × max(T/T_0, ϕ_0) × T^-(1-ℓ/β), where ℓ is the regularization exponent.
The optimal regularization exponent is ℓ⋆ = min((b-a)ℓcrit, β), where b = log(T_0) / log(T), a is the scaling of the number of generations n with T, and ℓcrit is the optimal exponent without fake data.
Quotes
"Training large language models or generative AI models on their own synthesized outputs can lead to a phenomenon known as 'model collapse', where the model's performance degrades over time until it becomes completely useless."
"Our analysis reveals that AI-generated data alters the optimal regularization for downstream models. Drawing from the insight that regularization mirrors early stopping, our study suggests that models trained on mixed real and AI-generated data may initially improve but later decline in performance (model collapse), necessitating early detection of this inflection point."