toplogo
Accedi

Advancements in Neural Data-to-Text Generation: A Comprehensive Survey


Concetti Chiave
This survey offers a consolidated view of the neural data-to-text (D2T) paradigm, examining the latest approaches, benchmark datasets, and evaluation protocols. It highlights promising avenues for D2T research that focus on linguistic capabilities as well as fairness and accountability.
Sintesi
This survey provides a structured examination of innovations in neural data-to-text (D2T) generation over the past half-decade. It covers the following key aspects: Datasets for D2T Generation: Meaning Representations (MRs): Datasets like RoboCup, WeatherGov, BAGEL, and E2E that align data instances with their corresponding narratives. Graph Representations: Datasets like WebNLG, AGENDA, and DART that use graph-based data instances. Tabular Representations: Datasets like WikiBio, RotoWire, TabFact, and ToTTo that use tabular data instances. Data Collection and Enrichment: Approaches to efficiently acquire and enrich D2T datasets. Data-to-Text Generation Fundamentals: Content Selection: Determining what information from the data instance to include in the narrative. Surface Realization: Constructing the narrative structure using words, phrases, and paragraphs. Hallucinations and Omissions: Balancing coherence, linguistic diversity, and data fidelity. Innovations in the Seq2Seq Framework: Supervised Learning: Architectural and loss-function interventions to improve data fidelity, including entity encoders, hierarchical encoders, plan encoders, graph encoders, reconstruction and hierarchical decoders, regularization techniques, and reinforcement learning. Unsupervised Learning: Approaches like variational autoencoders, adversarial training, and iterative text editing to induce templates and improve generation. Data Preprocessing: Delexicalization and Noise Reduction: Techniques to handle data sparsity and semantic noise. Linearization: Effective representations of graph-structured data for seq2seq models. Data Augmentation: Approaches to enrich the training data, including permutation, adversarial examples, and few-shot learning. Evaluation Protocols: Automated Metrics: Word-overlap, extractive, and semantic metrics to assess D2T systems. Human Evaluation: Protocols to measure coherence, fluency, and faithfulness of generated narratives. The survey highlights promising avenues for further D2T research, including fairness, accountability, and the integration of numerical reasoning capabilities.
Statistiche
The survey discusses several key statistics and figures related to D2T datasets, including: The RoboCup dataset contains 1,539 pairs of temporally ordered simulated soccer games with their respective human commentaries. The WeatherGov dataset has 29,528 MR-text pairs, each consisting of 36 different weather states. The E2E dataset has 50,602 instances of MR-text pairs of restaurant descriptions. The WebNLG dataset has 27,731 multi-domain graph-text pairs. The DART dataset has 82,191 open-domain graph-text pairs. The WikiBio dataset has 728,321 table-text pairs of Wikipedia info-boxes and their associated article paragraphs. The RotoWire dataset has 4,853 verbose descriptions of NBA game statistics, with an average reference length of 337 words.
Citazioni
"A picture is worth a thousand words - isn't it? And hence graphical representation is by its nature universally superior to text - isn't it? Why then isn't the anecdote itself represented graphically?" "Bloomberg News generates a third of its content with Cyborg, their in-house automation system that can dissect tedious financial reports and churn out news articles within seconds." "The neural boom that has sparked natural language processing (NLP) research throughout the last decade has similarly led to significant innovations in data-to-text generation (D2T)."

Approfondimenti chiave tratti da

by Mandar Sharm... alle arxiv.org 04-03-2024

https://arxiv.org/pdf/2207.12571.pdf
Innovations in Neural Data-to-text Generation

Domande più approfondite

How can D2T systems be designed to better handle numerical reasoning and mathematical operations over tabular data?

In order to improve D2T systems' ability to handle numerical reasoning and mathematical operations over tabular data, several strategies can be implemented: Specialized Encoders: Incorporating specialized encoders that can effectively capture numerical information from the tabular data. This can involve encoding numerical values, operations, and relationships between them in a structured manner. Mathematical Operation Encoders: Introducing specific encoders that can understand and process mathematical operations within the tabular data. This can help the system generate accurate and contextually relevant narratives involving numerical calculations. Attention Mechanisms: Implementing attention mechanisms that can focus on numerical values and their relationships during the generation of text. This can help the system prioritize numerical information and ensure its proper integration into the generated narratives. Data Augmentation: Augmenting the training data with a variety of numerical scenarios and mathematical operations to expose the system to a diverse range of numerical reasoning tasks. This can help improve the system's ability to handle different types of numerical information. Domain-Specific Knowledge: Incorporating domain-specific knowledge about mathematical operations and numerical reasoning into the system. This can help the system generate more accurate and contextually relevant narratives when dealing with tabular data containing numerical information. By implementing these strategies, D2T systems can enhance their capability to handle numerical reasoning and mathematical operations over tabular data effectively.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star