Core Concepts
The choice of the Speech Foundation Model (SFM) is the most critical factor influencing downstream performance in Automatic Speech Recognition (ASR) and Speech Translation (ST), while the choice of the Large Language Model (LLM) and length adapter design have a less pronounced impact.
Abstract
The paper explores the integration of Speech Foundation Models (SFMs) and Large Language Models (LLMs) for speech-to-text tasks, focusing on the relative importance of the different components in the overall architecture.
The key findings are:
The choice of the SFM is the most critical factor, with the best configurations using SeamlessM4T outperforming the best ones with Whisper by more than 2 COMET points on ST and 1 WER on ASR on average.
The choice of the LLM (Mistral or Llama) has a less pronounced impact, with a gap of less than 0.2 on both ASR and ST between the best configurations.
There is no one-size-fits-all solution for the length adapter, as the optimal choice depends on the specific combination of SFM and LLM. Content-based length adapters (CIF-based and CTC-based) consistently underperform other strategies.
The Base adapter, which does not compress the speech sequence, and the WLQ-former, which has high compression factors, achieve competitive scores in most settings, suggesting that reducing the sequence length mismatch between speech and text is less crucial than previously assumed.
Overall, the results highlight the need to experiment with different SFM and LLM combinations when evaluating adapter solutions, as improvements in one specific scenario may not generalize.
Stats
The average compression ratio for the CIF-based adapter is 25:1 with Whisper and 3:1 with SeamlessM4T.
The average compression ratio for the CTC-based adapter is 13:1 with Whisper and 2:1 with SeamlessM4T.
The compression ratio for the WLQ-former adapter is 16:1 with Whisper and 2:1 with SeamlessM4T.
Quotes
"The choice of the SFM is the most critical factor influencing downstream performance, while the choice of the LLM and length adapter has a less pronounced impact on the final performance."
"There is no one-size-fits-all solution for the length adapter as its choice highly depends on the selected SFM and LLM combination."