Core Concepts
A single multi-task learning model "UniverSLU" can perform various speech classification and sequence generation tasks, often outperforming or matching state-of-the-art task-specific models. UniverSLU leverages natural language instructions as prompts to enhance user-friendliness and generalization.
Abstract
The paper introduces a prompt-based multi-task learning (MTL) framework for spoken language understanding (SLU) tasks. The key contributions are:
A novel approach that leverages human-interpretable natural language instructions combined with a list of option labels as prompts. This enhances the model's user-friendliness and ability to generalize to unseen paraphrases.
Evaluation of the model's zero-shot capabilities, where it can generalize to new datasets and languages for seen task types, although it struggles with fully unseen task types.
Building a single MTL model, named UniverSLU, that can perform 12 SLU task types across 17 datasets and 9 languages. UniverSLU often outperforms or matches state-of-the-art task-specific models.
The paper starts by adapting a pre-trained automatic speech recognition model to additional SLU tasks using single-token task specifiers. It then enhances this approach through "instruction tuning", where the model is fine-tuned by describing the task using natural language instructions followed by the list of label options.
The authors demonstrate the effectiveness of UniverSLU on a diverse set of SLU tasks, including speech classification (intent classification, speech command recognition, emotion recognition, etc.) and sequence generation (named entity recognition, semantic parsing). UniverSLU achieves competitive or superior performance compared to task-specific baselines on most tasks.
The paper also investigates the model's zero-shot capabilities, finding that it can generalize to new datasets and languages for seen task types, but struggles with fully unseen task types. The authors discuss the potential limitations and future work to address these challenges.
Stats
"Our experiments encompass 10 speech classification and 2 sequence generation task types, covering 17 publicly available datasets based on 9 languages."
"We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages."
Quotes
"Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models."
"We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages."
"Our approach can generalize to new task descriptions for the seen tasks during inference, thereby enhancing its user-friendliness."