Core Concepts
A novel method for numerically evaluating text quality in large unlabelled NLP datasets to identify and eliminate low-quality text instances, leading to improved training efficiency for Language Models.
Abstract
The paper proposes a novel method for numerically evaluating text quality in large unlabelled NLP datasets. The approach combines 14 heuristic-based filters covering a wide range of linguistic characteristics to assign a "quality score" to each text instance in a model-agnostic manner. By leveraging this quality score, the authors demonstrate how low-quality text instances can be pruned from the dataset, enabling the training of Language Models (LMs) using only a fraction of the data while achieving comparable or even improved performance.
The key highlights of the paper are:
The authors establish a framework to quantitatively evaluate text quality, addressing the lack of objective methods for assessing data quality for LM training.
The proposed text quality metric is model-agnostic, allowing it to be reused across different LM models without the need for recomputation.
Experimental results on the OpenWebText and Wikipedia datasets show that by pruning the datasets based on text quality, the authors achieve substantial gains in training efficiency and effectiveness:
For OpenWebText, they observe an absolute accuracy improvement of 0.9% averaged over 14 downstream evaluation tasks while using 40% less data and training 42% faster.
For Wikipedia, they achieve a 0.8% average absolute accuracy improvement while using 20% less data and training 21% faster.
The authors also demonstrate that their approach can help remove potentially harmful content from the data by ensuring that low-quality text, which may include such content, is pruned from the dataset.
The paper's contributions extend beyond immediate LM training, as the introduced text quality evaluation framework provides a foundation for further advancements in data curation, dataset selection, and the development of automated methods for text quality assessment.
Stats
The OpenWebText dataset contains 9.03 billion tokens before pruning or splitting into train and validation sets.
The Wikipedia dataset contains 4.67 billion tokens before pruning or splitting into train and validation sets.
Quotes
"By leveraging this numerical text quality score, we demonstrate how it can be used to prune the original dataset, enabling the training of LMs using only a fraction of the data."
"We observe an absolute improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the OpenWebText dataset and 0.8% average absolute accuracy improvement while using 20% lesser data and training 21% faster on the Wikipedia dataset."