核心概念
Instruction tuning in speech models is crucial for zero-shot learning, as demonstrated by the Dynamic-SUPERB benchmark.
統計
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However existing studies primarily focus on limited or specific tasks. Moreover the lack of standardized benchmarks hinders a fair comparison across different approaches. We present Dynamic-SUPERB a benchmark designed for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion. To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning we invite the community to collaborate and contribute facilitating the dynamic growth of the benchmark. To initiate Dynamic-SUPERB features 55 evaluation instances by combining 33 tasks and 22 datasets. This spans a broad spectrum of dimensions providing a comprehensive platform for evaluation. Additionally we propose several approaches to establish benchmark baselines. These include the utilization of speech models text language models and multimodal encoder. Evaluation results indicate that while these baselines perform reasonably on seen tasks they struggle with unseen ones. We release all materials to the public and welcome researchers to collaborate on the project advancing technologies in the field together1.
Each task consists of at least three components: speech utterance instruction and label. We sourced it from widely-used corpora such as LibriSpeech LJSpeech and VCTK . Depending on the task specification there may be more than one utterance . For example in speaker verification two utterances are involved to determine whether they are produced by the same speaker . For instructions we chose to use text format rather than spoken format . Spoken instructions with varying content speaker characteristics and prosody are more complex than text-based ones . Text instructions serve as an intermediary bridging text and spoken instruction tuning . To generate various instructions we initially created some basic ones manually and then used ChatGPT to rephrase them and generate additional variations . This results in each task containing approximately 10 to 30 different types of instructions .
We propose five approaches to establish baselines in Dynamic-SUPERB . We integrated text embeddings from BERT into the generative spoken language model GSLM enabling operations on both speech and text . We also adapted Whisper which was primarily designed for speech recognition involving both speech and text ,to instruction tuning . Besides modifying speech models we integrated speech representations from either Whisper or ImageBind into LLaMA a prevalent large language model LLM . We also utilize Whisper ASR ,and ChatGPT ,to build a concatenative system ASR - ChatGPT .
While instruction tuning is increasingly popular for enabling zero-shot applications in NLP it remains underexplored in speech processing . This paper presents Dynamic-SUPERB ,the first dynamic collaborative benchmark for instruction tuning in speech models offering a comprehensive exploration across diverse dimensions .
引用
"We believe that instruction tuning is a pivotal step towards universal speech models."
"Dynamic-SUPERB seeks to dynamically expand its task variety through community collaboration."