The proposed semi-supervised learning (SSL) framework introduces a multi-view pseudo-labeling method that utilizes both acoustic and linguistic characteristics to select the most confident unlabeled data for training a bimodal classifier.
The acoustic path employs Fréchet Audio Distance (FAD) to measure the similarity between labeled and unlabeled data based on embeddings from multiple audio encoders. The linguistic path uses large language models (LLMs) with task-specific prompts to predict labels from automatic speech recognition (ASR) transcriptions, leveraging insights from acoustics, linguistics, and psychology.
Data with matching pseudo-labels from both paths are considered high-confidence and used to train the initial bimodal classifier. The classifier is then iteratively updated by incorporating low-confidence data whose pseudo-labels align with either the acoustic or linguistic path. Multiple fusion techniques are compared to effectively utilize the multi-view knowledge.
The proposed SSL framework is evaluated on emotion recognition and dementia detection tasks, demonstrating competitive performance compared to fully supervised learning while using only 30% of the labeled data. It also significantly outperforms selected baselines, including decision merging and co-training.
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Wichtige Erkenntnisse aus
by Yuanchao Li,... um arxiv.org 09-26-2024
https://arxiv.org/pdf/2409.16937.pdfTiefere Fragen