The authors present "TeClass", a novel dataset for relevance-based headline classification in the Telugu language. The dataset contains 26,178 article-headline pairs, annotated by human annotators into three categories: Highly Related (HREL), Moderately Related (MREL), and Least Related (LREL).
The authors conduct comprehensive experiments using various baseline models, including traditional machine learning approaches and state-of-the-art BERT-based models. The results show that the BERT-based models, particularly mDeBERTa, outperform the classical machine learning models, achieving an F1 weighted score of 0.63 and an F1 macro score of 0.64.
Furthermore, the authors demonstrate the impact of the TeClass dataset on improving headline generation models. They fine-tune an mT5 model on different subsets of the dataset and observe a significant improvement in ROUGE-L scores (around 5 points) when the model is trained on highly relevant article-headline pairs compared to the non-fine-tuned model.
The authors emphasize the importance of high-quality, relevance-based data for headline generation tasks, as the presence of irrelevant headlines can negatively impact the performance of such models. The TeClass dataset and the annotation guidelines are made publicly available to encourage future research in this area.
다른 언어로
소스 콘텐츠 기반
arxiv.org
핵심 통찰 요약
by Gopichand Ka... 게시일 arxiv.org 04-18-2024
https://arxiv.org/pdf/2404.11349.pdf더 깊은 질문