核心概念
This survey offers a consolidated view of the neural data-to-text (D2T) paradigm, examining the latest approaches, benchmark datasets, and evaluation protocols. It highlights promising avenues for D2T research that focus on linguistic capabilities as well as fairness and accountability.
摘要
This survey provides a structured examination of innovations in neural data-to-text (D2T) generation over the past half-decade. It covers the following key aspects:
Datasets for D2T Generation:
Meaning Representations (MRs): Datasets like RoboCup, WeatherGov, BAGEL, and E2E that align data instances with their corresponding narratives.
Graph Representations: Datasets like WebNLG, AGENDA, and DART that use graph-based data instances.
Tabular Representations: Datasets like WikiBio, RotoWire, TabFact, and ToTTo that use tabular data instances.
Data Collection and Enrichment: Approaches to efficiently acquire and enrich D2T datasets.
Data-to-Text Generation Fundamentals:
Content Selection: Determining what information from the data instance to include in the narrative.
Surface Realization: Constructing the narrative structure using words, phrases, and paragraphs.
Hallucinations and Omissions: Balancing coherence, linguistic diversity, and data fidelity.
Innovations in the Seq2Seq Framework:
Supervised Learning: Architectural and loss-function interventions to improve data fidelity, including entity encoders, hierarchical encoders, plan encoders, graph encoders, reconstruction and hierarchical decoders, regularization techniques, and reinforcement learning.
Unsupervised Learning: Approaches like variational autoencoders, adversarial training, and iterative text editing to induce templates and improve generation.
Data Preprocessing:
Delexicalization and Noise Reduction: Techniques to handle data sparsity and semantic noise.
Linearization: Effective representations of graph-structured data for seq2seq models.
Data Augmentation: Approaches to enrich the training data, including permutation, adversarial examples, and few-shot learning.
Evaluation Protocols:
Automated Metrics: Word-overlap, extractive, and semantic metrics to assess D2T systems.
Human Evaluation: Protocols to measure coherence, fluency, and faithfulness of generated narratives.
The survey highlights promising avenues for further D2T research, including fairness, accountability, and the integration of numerical reasoning capabilities.
統計資料
The survey discusses several key statistics and figures related to D2T datasets, including:
The RoboCup dataset contains 1,539 pairs of temporally ordered simulated soccer games with their respective human commentaries.
The WeatherGov dataset has 29,528 MR-text pairs, each consisting of 36 different weather states.
The E2E dataset has 50,602 instances of MR-text pairs of restaurant descriptions.
The WebNLG dataset has 27,731 multi-domain graph-text pairs.
The DART dataset has 82,191 open-domain graph-text pairs.
The WikiBio dataset has 728,321 table-text pairs of Wikipedia info-boxes and their associated article paragraphs.
The RotoWire dataset has 4,853 verbose descriptions of NBA game statistics, with an average reference length of 337 words.
引述
"A picture is worth a thousand words - isn't it? And hence graphical representation is by its nature universally superior to text - isn't it? Why then isn't the anecdote itself represented graphically?"
"Bloomberg News generates a third of its content with Cyborg, their in-house automation system that can dissect tedious financial reports and churn out news articles within seconds."
"The neural boom that has sparked natural language processing (NLP) research throughout the last decade has similarly led to significant innovations in data-to-text generation (D2T)."