Conceitos essenciais
Recursive training on synthetic data generated from previous language models inevitably leads to model collapse, where the trained models lose diversity and converge to Dirac distributions. Incorporating a sufficient amount of real data can help mitigate this issue.
Resumo
The paper presents a statistical analysis of the phenomenon of "model collapse" in language models, where recursive training on synthetic data generated from previous models leads to a deterioration in performance and a loss of linguistic diversity.
The key insights are:
In the Fully Synthetic setting, where the model is trained solely on synthetic data, total collapse is unavoidable. The paper provides theoretical results characterizing the rate at which this collapse occurs, showing an exponential dependence on the number of generations and a polynomial dependence on the sample size.
In the Partially Synthetic setting, where the model is trained on a mixture of real and synthetic data, the paper provides an upper bound on the amount of synthetic data that can be used to avoid model collapse. Specifically, the amount of synthetic data should be logarithmic in the ratio of the real data size to the vocabulary size.
The paper also considers more realistic scenarios, such as training on a mixture of data from the most recent K generations or randomly sampled from all previous generations. The results show that while these settings delay the onset of collapse, it still eventually occurs.
The theoretical analysis is supported by experiments on both the simple statistical model and more realistic transformer-based language models, confirming the key findings.
Estatísticas
The paper does not provide any specific numerical data or statistics. The analysis is based on theoretical derivations and simulations.
Citações
The paper does not contain any direct quotes that are particularly striking or support the key arguments.