Building a Hungarian Corpus for Extractive and Abstractive Summarization
This paper introduces HunSum-2, an open-source Hungarian corpus suitable for training abstractive and extractive summarization models. The dataset is constructed from the Common Crawl corpus, with thorough cleaning, preprocessing, and deduplication. The authors also generate sentence-level labels for extractive summarization using sentence similarity. Baseline models for both extractive and abstractive summarization are trained and evaluated on the dataset.