Concetti Chiave
This paper presents a keyphrase generation model based on the Text-to-Text Transfer Transformer (T5) architecture, which can generate keyphrases that adequately define a document's content, including those not present in the original text. The authors also introduce a novel keyphrase filtering technique based on the T5 architecture to eliminate irrelevant keyphrases.
Sintesi
The paper focuses on the task of automatic keyphrase labelling, which involves retrieving words or short phrases that adequately describe a document's content. Previous work has explored extractive techniques to address this task, but these methods are limited to identifying keyphrases present in the input text. To overcome this limitation, the authors propose a keyphrase generation model based on the T5 architecture, named docT5keywords.
The authors explore two main approaches:
-
Keyphrase Generation:
- The docT5keywords model is fine-tuned on a text-to-text task, where the input is the document's title and abstract, and the output is the set of keyphrases.
- The authors evaluate two fine-tuning strategies: one using the t5-base model and another using the flan-t5-base model.
- They also investigate the impact of a majority voting inference approach, where multiple keyphrase sequences are generated, and the keyphrases are ranked based on their frequency of occurrence.
-
Keyphrase Filtering:
- The authors introduce a novel keyphrase filtering model, keyFilT5r, which is also based on the T5 architecture.
- The filtering model is fine-tuned to learn whether a given keyphrase is relevant to a document, using both soft (based on keyphrase co-occurrence) and hard (based on generated keyphrases) negative examples.
- The filtering model is evaluated in two ways: 1) a binary evaluation to measure its accuracy in identifying relevant keyphrases, and 2) by filtering the predicted keyphrases of other models and checking if the evaluation scores improve.
The experimental results demonstrate that the docT5keywords model significantly outperforms various baselines, including supervised and unsupervised keyphrase extraction and generation models. The proposed keyphrase filtering technique also achieves near-perfect accuracy in eliminating false positives across all datasets.
Statistiche
The paper uses several datasets for evaluation, including Inspec, KP20k, KP-BioMed, MAG, and KPTimes. The datasets vary in size, domain, and the type of keyphrase annotation (e.g., author-provided, semi-automatic).
Citazioni
"Automatic keyphrase labelling stands for the ability of models to retrieve words or short phrases that adequately describe documents' content."
"Given this limitation, keyphrase generation approaches have arisen lately."
"One indicator of the reliance on keyphrases in academic search and recommendation is that publishers often ask authors to label their publications with keyphrases manually."