The researchers present a case study that extends the application of LLM-based data annotation to enhance the quality of the existing Multi-News dataset, a widely used dataset for multi-document summarization. The Multi-News dataset was constructed by crawling news articles from the internet, which often resulted in the inclusion of noisy and irrelevant documents.
To address this issue, the researchers designed a framework that utilizes LLMs to analyze the summary and associated documents, identifying and excluding any documents that are not relevant to the summary. Specifically, they employed approaches such as chain-of-thought (CoT) to provide the rationale for decision-making, enhancing transparency and facilitating human investigation. They further improved the cleansing process by incorporating self-consistency considerations, mimicking the majority voting process used by human annotators.
Based on this framework, the researchers introduced MULTI-NEWS+, an enhanced version of the Multi-News dataset, by removing the identified noisy documents. Experiments demonstrated that models trained on MULTI-NEWS+ outperformed those trained on the original Multi-News dataset, indicating the improved quality of the dataset. The researchers made MULTI-NEWS+ and the source code publicly available for further study.
This work showcases the potential of leveraging LLMs for cost-efficient dataset cleansing, which can be extended to other datasets across various domains, contributing to the advancement of natural language processing research.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Juhwan Choi,... في arxiv.org 04-16-2024
https://arxiv.org/pdf/2404.09682.pdfاستفسارات أعمق