Core Concepts
Developing a high-quality, human-annotated dataset for relevance-based classification of Telugu news headlines, and demonstrating its impact on improving headline generation models.
Abstract
The authors present "TeClass", a novel dataset for relevance-based headline classification in the Telugu language. The dataset contains 26,178 article-headline pairs, annotated by human annotators into three categories: Highly Related (HREL), Moderately Related (MREL), and Least Related (LREL).
The authors conduct comprehensive experiments using various baseline models, including traditional machine learning approaches and state-of-the-art BERT-based models. The results show that the BERT-based models, particularly mDeBERTa, outperform the classical machine learning models, achieving an F1 weighted score of 0.63 and an F1 macro score of 0.64.
Furthermore, the authors demonstrate the impact of the TeClass dataset on improving headline generation models. They fine-tune an mT5 model on different subsets of the dataset and observe a significant improvement in ROUGE-L scores (around 5 points) when the model is trained on highly relevant article-headline pairs compared to the non-fine-tuned model.
The authors emphasize the importance of high-quality, relevance-based data for headline generation tasks, as the presence of irrelevant headlines can negatively impact the performance of such models. The TeClass dataset and the annotation guidelines are made publicly available to encourage future research in this area.
Stats
The average number of sentences in the articles is around 10.
The average number of tokens in the articles is around 126.
The average number of tokens in the headlines is around 6.
Quotes
"Relevance-based headline classification can greatly aid the task of generating relevant headlines."
"The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores."