The paper proposes a novel knowledge distillation framework for visual speech recognition (VSR) called JEP-KD, which stands for Joint-Embedding Predictive Architecture-based Knowledge Distillation. The key innovation is the inclusion of a generative network within the embedding layer, which serves to enhance the video encoder's ability to extract semantic features and better align them with the audio features from a pre-trained ASR model.
The authors argue that the prevalent knowledge distillation methods for VSR, which rely on rigid alignment of video and audio features, are suboptimal due to the inherent semantic limitations of the video modality. They hypothesize that the semantic gaps between video and audio are systematic and predictable, and can be addressed by the proposed predictive framework.
The JEP-KD architecture consists of four main components: the video encoder, the generator, the discriminator, and the decoder. The training process is divided into three stages:
Experiments on the CMLR dataset show that the JEP-KD framework significantly improves the performance of VSR models, reducing the character error rate (CER) from 19.92% to 14.26%. Further pre-training on additional datasets leads to a CER of 11.97%, demonstrating the versatility of the approach. However, the authors note that there is still a substantial gap compared to ASR models, suggesting room for further research to enhance the predictive capabilities of the model.
To Another Language
from source content
arxiv.org
Viktige innsikter hentet fra
by Chang Sun,Ho... klokken arxiv.org 03-29-2024
https://arxiv.org/pdf/2403.18843.pdfDypere Spørsmål