Główne pojęcia
DiLM trains a language model to generate informative synthetic text samples that can be used to train different types of models, independent of their word embedding weights.
Streszczenie
The paper proposes a novel text dataset distillation approach called Distilling dataset into Language Model (DiLM), which addresses the discreteness of text by using a language model as a surrogate optimization target instead of directly optimizing synthetic text samples.
Key highlights:
- DiLM trains a language model to generate synthetic training samples that are more informative than the real samples in the original dataset, by minimizing the gradient matching loss between the generated and real samples.
- To enable back-propagating the gradient matching loss to the language model, DiLM designs a differentiable backward pass via loss weighting with generation probabilities, bypassing the non-differentiable generated text.
- DiLM outperforms current coreset selection methods not only for training the same model used for distillation, but also for training different models independent of their word embedding weights, architectures, and training processes.
- DiLM's distilled synthetic datasets also achieve remarkable generalization performance for in-context learning of large language models.
Statystyki
The original training datasets used in the experiments are:
SST-2 (67.3k samples, 2 classes)
QQP (364k samples, 2 classes)
MNLI-m (393k samples, 3 classes)
Cytaty
"To the best of our knowledge, this is the first study to distill a text dataset into a text-level synthetic dataset that are applicable for training models independent of word embedding weights."
"We present DiLM, which addresses the discreteness of text by using a language model as a surrogate optimization target and back-propagating the distillation loss to the model, bypassing non-differentiable generated text."
"Our experimental results indicate that DiLM outperformed the current coreset selection methods not only for training the same model used for distillation, but also for training different models independent of the word embedding weights, architectures, and training processes."