核心概念
The paper introduces JEP-KD, a joint-embedding predictive architecture that leverages a generative network within the embedding layer to enhance the video encoder's capacity for semantic feature extraction and better align it with audio features from a pre-trained ASR model. This approach aims to progressively reduce the performance gap between visual speech recognition (VSR) and automatic speech recognition (ASR).
要約
The paper proposes a novel knowledge distillation framework for visual speech recognition (VSR) called JEP-KD, which stands for Joint-Embedding Predictive Architecture-based Knowledge Distillation. The key innovation is the inclusion of a generative network within the embedding layer, which serves to enhance the video encoder's ability to extract semantic features and better align them with the audio features from a pre-trained ASR model.
The authors argue that the prevalent knowledge distillation methods for VSR, which rely on rigid alignment of video and audio features, are suboptimal due to the inherent semantic limitations of the video modality. They hypothesize that the semantic gaps between video and audio are systematic and predictable, and can be addressed by the proposed predictive framework.
The JEP-KD architecture consists of four main components: the video encoder, the generator, the discriminator, and the decoder. The training process is divided into three stages:
- Warm-up phase: Train the encoder, generator, and decoder with CTC and CE losses.
- Enhancement stage: Lock the encoder and decoder, train the generator and discriminator using adversarial loss and a distance loss between video and audio semantic features.
- Refinement phase: Lock the encoder, generator, and discriminator, and fine-tune the decoder.
Experiments on the CMLR dataset show that the JEP-KD framework significantly improves the performance of VSR models, reducing the character error rate (CER) from 19.92% to 14.26%. Further pre-training on additional datasets leads to a CER of 11.97%, demonstrating the versatility of the approach. However, the authors note that there is still a substantial gap compared to ASR models, suggesting room for further research to enhance the predictive capabilities of the model.
統計
The CMLR dataset contains 102,072 spoken sentences, including 71,448 in the training set, 20,418 in the test set, and 10,206 in the validation set.
The ASR model (WeNet) achieves around 2% CER on the CMLR dataset without fine-tuning.
引用
"Utilizing trained ASR models to conduct knowledge distillation is a widely acknowledged and efficacious strategy to enhance performance."
"We hold that the semantic gaps manifesting in video modalities relative to audio ones adhere to a systematic pattern — that is, these consistent omissions are correlated with sentence content and are, therefore, predictable."
"Considering the considerable performance gap that still exists between VSR and ASR, enhancing the capabilities of VSR as much as possible remains the most important research objective at the current stage."