Instruction tuning in speech models is crucial for zero-shot learning, as demonstrated by the Dynamic-SUPERB benchmark.
Text language models excel in zero-shot learning with well-formulated instructions. Dynamic-SUPERB aims to create universal speech models through instruction tuning for diverse tasks.