The KoCoNovel dataset is derived from 50 modern and contemporary Korean novels, comprising 178,957 tokens and 19,030 character mentions. It stands as the first Korean coreference resolution dataset based on literary texts, addressing the challenges posed by the nuances of Korean address terms and the lack of grammatical markers for proper nouns.
The key highlights of the KoCoNovel dataset include:
Revised annotation guidelines to accommodate the culture of address terms in Korea, where characters are frequently referred to by common nouns denoting social relationships and kinship rather than personal names.
Four distinct versions of the dataset, offering annotations from the perspectives of the omniscient author and the readers, as well as options for treating multiple entities as either separate or overlapping.
Detailed analysis revealing that 24% of all character mentions in KoCoNovel are single common nouns, highlighting the unique linguistic characteristics of the Korean language.
Experiments with BERT-based coreference models demonstrating notable performance improvements on KoCoNovel compared to models trained solely on the non-literary NIKL corpus, underscoring the dataset's potential to enhance coreference resolution in Korean.
The KoCoNovel dataset provides a comprehensive resource for exploring character coreference resolution in Korean literature, integrating the cultural and linguistic dynamics of the Korean language.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Kyuhee Kim,S... at arxiv.org 04-02-2024
https://arxiv.org/pdf/2404.01140.pdfDeeper Inquiries