The authors present "TeClass", a novel dataset for relevance-based headline classification in the Telugu language. The dataset contains 26,178 article-headline pairs, annotated by human annotators into three categories: Highly Related (HREL), Moderately Related (MREL), and Least Related (LREL).
The authors conduct comprehensive experiments using various baseline models, including traditional machine learning approaches and state-of-the-art BERT-based models. The results show that the BERT-based models, particularly mDeBERTa, outperform the classical machine learning models, achieving an F1 weighted score of 0.63 and an F1 macro score of 0.64.
Furthermore, the authors demonstrate the impact of the TeClass dataset on improving headline generation models. They fine-tune an mT5 model on different subsets of the dataset and observe a significant improvement in ROUGE-L scores (around 5 points) when the model is trained on highly relevant article-headline pairs compared to the non-fine-tuned model.
The authors emphasize the importance of high-quality, relevance-based data for headline generation tasks, as the presence of irrelevant headlines can negatively impact the performance of such models. The TeClass dataset and the annotation guidelines are made publicly available to encourage future research in this area.
他の言語に翻訳
原文コンテンツから
arxiv.org
抽出されたキーインサイト
by Gopichand Ka... 場所 arxiv.org 04-18-2024
https://arxiv.org/pdf/2404.11349.pdf深掘り質問