Core Concepts
CoDa, a controllable and effective data augmentation technique for low-resource NLP, generates synthetic training instances by prompting off-the-shelf instruction-following Large Language Models to produce text that satisfies a set of simple constraints extracted from the original low-resource dataset.
Abstract
The paper presents CoDa, a novel data augmentation methodology for low-resource Natural Language Processing (NLP) tasks. CoDa works with any off-the-shelf instruction-tuned Large Language Model (LLM) in a training-free fashion and provides explicit control over the generated augmentations.
The key steps are:
For each training instance in the low-resource dataset, CoDa extracts a set of simple heuristic-based constraints, including lexical, syntactic, semantic, length, and concept constraints.
These constraints are then verbalized into a natural language instruction and used to prompt an LLM to generate augmented training instances.
The generated augmentations are then added to the original low-resource dataset to train downstream NLP models.
CoDa is shown to outperform various prior data augmentation techniques, including text-editing, learning-based infilling, LLM-based prompting, and rephrasing methods, across 11 datasets spanning 3 tasks (Sequence Classification, Named Entity Recognition, and Question Answering) and 3 low-resource settings (100, 200, and 500 training examples). The improvements range from 0.12% to 7.19% in F1 score.
CoDa is the first framework to explore controlled generation for data augmentation, ensuring the synthetic data is closely aligned with the specific needs of the task and characteristics of the target domain. It provides a simpler and more intuitive natural language-based interface for constrained generation compared to complex decoding-time techniques or manual attribute extraction.
Stats
Grindr reserves the right to terminate or suspend your account at any time, with or without notice, for any reason or no reason, and without liability.
Violation of these terms may result in legal action.
Grindr will promptly terminate without notice the accounts of Users that are determined by Grindr to be "repeat infringers".