The researchers present a case study that extends the application of LLM-based data annotation to enhance the quality of the existing Multi-News dataset, a widely used dataset for multi-document summarization. The Multi-News dataset was constructed by crawling news articles from the internet, which often resulted in the inclusion of noisy and irrelevant documents.
To address this issue, the researchers designed a framework that utilizes LLMs to analyze the summary and associated documents, identifying and excluding any documents that are not relevant to the summary. Specifically, they employed approaches such as chain-of-thought (CoT) to provide the rationale for decision-making, enhancing transparency and facilitating human investigation. They further improved the cleansing process by incorporating self-consistency considerations, mimicking the majority voting process used by human annotators.
Based on this framework, the researchers introduced MULTI-NEWS+, an enhanced version of the Multi-News dataset, by removing the identified noisy documents. Experiments demonstrated that models trained on MULTI-NEWS+ outperformed those trained on the original Multi-News dataset, indicating the improved quality of the dataset. The researchers made MULTI-NEWS+ and the source code publicly available for further study.
This work showcases the potential of leveraging LLMs for cost-efficient dataset cleansing, which can be extended to other datasets across various domains, contributing to the advancement of natural language processing research.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Juhwan Choi,... a las arxiv.org 04-16-2024
https://arxiv.org/pdf/2404.09682.pdfConsultas más profundas