Core Concepts
This work aims to uncover the trends and insights within the new datasets introduced at the top NLP conferences, ACL and EMNLP, in 2022.
Abstract
This study focuses on analyzing the datasets introduced at the ACL and EMNLP conferences in 2022, which are recognized as leading venues for natural language processing research. The key insights are:
Coverage of NLP Tasks: The datasets cover a wide range of NLP tasks, including text generation, summarization, classification, information extraction, question answering, and more. The most common tasks are text generation, text summarization, text/token classification, information extraction, natural language understanding, and question answering.
Dataset Size: The dataset sizes vary significantly, with most containing between 10,000 to 50,000 instances. There are also a few very large datasets with over 1 million instances.
Collaboration in Dataset Construction: The authors come from a mix of academic and industry affiliations, indicating the benefits of collaboration between these two sectors. Academic institutions like Tsinghua University, University of Washington, and Singapore University of Technology and Design, as well as industry labs like Microsoft Research, Google Research, and Huawei Noah's Ark Lab, are prominent contributors.
Multimodality: There is a growing trend towards multimodal datasets that combine text with other modalities like images and videos. These datasets enable the development of visual-language systems for tasks such as multimodal dialogue summarization, visual question answering, and visual storytelling.
Multilingualism: While most datasets are in English, there is also a notable number of non-English and multilingual datasets, covering languages like Chinese, French, German, Spanish, and various Indic languages.
Overall, this analysis provides valuable insights into the current state and future directions of dataset curation in the field of natural language processing.
Stats
"The two top NLP conferences, ACL and EMNLP, accepted ninety-two papers in 2022, introducing new datasets."
"Most datasets have data samples in the range of 10,000 to 50,000."
"There are also a few very large datasets with over 1 million instances."
Quotes
"The need to have quality datasets has prompted NLP researchers to continue creating new datasets to satisfy particular needs."
"The industrial labs provide the practical use cases, large-scale data, computing resources, and funds to foot the costs necessary to build a new dataset. In contrast, academia provides theoretical insights, novel methodologies, and expertise in meticulous experimental design."
"There is a growing demand for NLP systems that accept at least two modalities as input, for example, text and image, leading to a further increase in demand for multimodal datasets."