JEP-KD: A Joint-Embedding Predictive Architecture for Enhancing Visual Speech Recognition through Knowledge Distillation
The paper introduces JEP-KD, a joint-embedding predictive architecture that leverages a generative network within the embedding layer to enhance the video encoder's capacity for semantic feature extraction and better align it with audio features from a pre-trained ASR model. This approach aims to progressively reduce the performance gap between visual speech recognition (VSR) and automatic speech recognition (ASR).