Core Concepts
Large Language Models (LLMs) can be used to generate high-quality dialogue data efficiently.
Abstract
LUCID is a data generation system that leverages Large Language Models (LLMs) to create diverse and challenging dialogues. It addresses the scarcity of high-quality dialogue data for virtual assistants by automating the process. The system generates multi-domain, multi-intent conversations with challenging conversational phenomena. LUCID's validation process ensures accurate system labels, leading to reliable evaluation results. The dataset includes 100 intents across 13 domains, showcasing the system's ability to create complex data efficiently.
LUCID decomposes the data generation process into multiple stages involving intent generation, conversation planning, turn-by-turn conversation generation, and LLM-based validation. The system ensures diversity in slot values and intents, surpassing existing task-oriented dialogue datasets in terms of complexity and labeled conversational phenomena. Additionally, LUCID provides detailed analysis and baseline results for evaluating the generated data.
The system's extensibility allows for large-scale data generation with more intents, domains, and complex conversational phenomena. By releasing both the code and dataset, LUCID aims to facilitate further research in generating high-quality dialogue data using LLMs.
Stats
Existing datasets include multi-turn, multi-intent conversations.
LUCID dataset contains 100 intents across 13 domains.
Seed dataset consists of 4,277 dialogues with 92,699 turns.
Quotes
"Virtual assistants are poised to take a dramatic leap forward in terms of their dialogue capabilities."
"We aim to overcome issues with high quality dialogue data using LUCID."
"LUCID generates diverse and challenging dialogues efficiently."