The paper proposes TI-ASU, a Text-to-speech Imputation approach for Automatic Speech Understanding (ASU), to address the challenge of missing speech modality in ASU applications. The core idea is to impute the missing speech modality using pre-trained text-to-speech (TTS) models.
The authors first investigate scenarios where speech data is missing in the training set, and show that TI-ASU can substantially improve ASU performance compared to training with limited real speech data. Even in extreme cases where 95% of speech data is missing, TI-ASU outperforms training with the remaining 5% real speech.
The authors further explore cases where speech data can be missing in both training and testing sets. They propose TI-ASU Dropout, which combines TI-ASU with dropout training to enhance model robustness against missing speech modality during inference. The results demonstrate that TI-ASU Dropout provides competitive or better performance than multimodal dropout training, especially when a significant portion of speech data is missing.
Additionally, the authors investigate leveraging large language models (LLMs) to augment the text transcriptions used for TTS-based speech imputation. While the LLM-assisted approach shows promise, the authors identify challenges in maintaining the quality of the generated speech samples.
Overall, the paper presents a comprehensive study on enhancing ASU through TTS-based speech imputation, demonstrating the effectiveness of the proposed TI-ASU framework in addressing missing speech modality challenges.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Tiantian Fen... at arxiv.org 04-30-2024
https://arxiv.org/pdf/2404.17983.pdfDeeper Inquiries