A simple and effective method, named Learning to Bootstrap (L2B), enables models to bootstrap themselves using their own predictions without being adversely affected by erroneous pseudo-labels by dynamically adjusting the importance weight between real observed and generated labels, as well as between different samples through meta-learning.
The paper introduces JEP-KD, a joint-embedding predictive architecture that leverages a generative network within the embedding layer to enhance the video encoder's capacity for semantic feature extraction and better align it with audio features from a pre-trained ASR model. This approach aims to progressively reduce the performance gap between visual speech recognition (VSR) and automatic speech recognition (ASR).