Core Concepts
Whisper models demonstrate effective in-context learning abilities for speech recognition, improving performance without gradient descent.
Abstract
Investigates in-context learning abilities of Whisper ASR models.
Proposes a novel speech-based in-context learning approach (SICL).
Achieves significant WER reductions using SICL on Chinese dialects.
Utilizes k-nearest-neighbours for efficient in-context example selection.
Validates findings through speaker adaptation and continuous speech recognition tasks.
Stats
"A k-nearest-neighbours-based in-context example selection technique can be applied to further improve the efficiency of SICL, which can increase the average relative WER reduction to 36.4%."
"Compared to ICL for text-based LLMs, SICL consistently reduces the word error rate (WER) irrespective of the Whisper model size or the specific dialect to adapt."
Quotes
"No gradient descent is required for test-time language-level adaptation using SICL."
"SICL consistently reduces the word error rate (WER) irrespective of the Whisper model size or the specific dialect to adapt."