Core Concepts
Neural models for semantic parsing and text generation with Discourse Representation Structures (DRS) exhibit strong performance on standard test sets, but struggle with more challenging benchmarks that require handling longer texts and compositional generalization.
Abstract
The authors argue that the current performance of neural semantic parsers and text generators on the Parallel Meaning Bank (PMB) dataset is inflated due to data leakage and non-representative test sets. To address this, they introduce several changes:
A systematic data split approach to create a more reliable standard test set, reducing overlap between training and test data.
A long-text challenge set with manually annotated longer documents to assess the models' performance on complex, multi-sentence texts.
A compositional challenge set created by recombining Combinatory Categorical Grammar (CCG) derivation trees, to evaluate the models' ability to handle compositional generalization.
Experiments with five neural models (LSTM, mT5, byT5, mBART, DRS-MLM) show that their performance declines significantly on the challenge sets, revealing the limitations of these models when confronted with longer texts and compositional structures. The authors conclude that semantic parsing and text-to-meaning generation are far from being solved tasks.
Stats
The average sentence length in the standard test set is around 5-6 words, while the long-text challenge set has an average of 61 words.
The systematic data split reduces the word overlap rate between training and test sets compared to the random split.
Quotes
"The rapid development of neural models and their incredible performance seem to make the impression that tasks like semantic parsing are practically solved."
"We carried out a critical examination of the PMB and revealed three (related) problems: (1) there is a "data leakage" from the training data to the development and test splits; (2) the random splits of the data lead to a non-optimal division; and (3) the test set is often regarded as "easy" as it contains a large amount of relatively short sentences."