Core Concepts
A novel method for synthesizing a large-scale, high-quality knowledge graph-to-text generation dataset (WikiOFGraph) that is independent of external ontologies and leverages Large Language Models and Data-QuestEval for effective graph extraction and data curation.
Abstract
The paper introduces a novel method for synthesizing a large-scale, general-domain knowledge graph-to-text (G2T) generation dataset called WikiOFGraph. The key highlights are:
-
Limitations of existing G2T datasets:
- Scarcity of high-quality, general-domain G2T datasets
- Ontology-based datasets suffer from graph-text misalignment
-
Proposed method:
- Leverages Large Language Models (LLMs) to extract graph representations directly from Wikipedia sentences
- Utilizes Data-QuestEval, a referenceless evaluation framework, to curate well-aligned graph-text pairs
- Generates 5.85M high-quality, ontology-free G2T samples covering a broad range of Wikipedia topics
-
Comprehensive analysis:
- WikiOFGraph exhibits high graph-text consistency, comparable to a human-crafted dataset
- Significantly outperforms existing ontology-based datasets in terms of domain diversity and scale
- Fine-tuning a PLM on WikiOFGraph leads to superior performance on general-domain G2T generation tasks
-
Additional experiments and case study:
- Demonstrate the effectiveness of Data-QuestEval in ensuring high-quality graph-text alignments
- Identify and address common issues in LLM-generated graph-text pairs, such as incomplete sentences and ambiguous pronouns
The proposed method provides a scalable and effective solution for generating high-quality G2T data without relying on proprietary LLMs, external ontologies, or extensive human involvement, making it a valuable contribution to advancing G2T generation research.
Stats
The average number of triplets per sample in WikiOFGraph is 3.62.
WikiOFGraph contains 140,733 unique predicates and 8.2M unique entities.
Quotes
"Recent advancements in Pretrained Language Models (PLMs) have improved G2T performance, but their effectiveness depends on datasets with precise graph-text alignment."
"To address this issue, we introduce an effective method for generating high-quality G2T dataset that integrates LLM with Data-QuestEval (Rebuffel et al., 2021)."
"Experimental results demonstrate that PLM fine-tuned on WikiOFGraph outperforms those trained on other datasets across various evaluation metrics."