Whisper-Flamingo, a novel model integrating visual features from AV-HuBERT into the Whisper model using gated cross attention, achieves state-of-the-art performance in both audio-visual speech recognition and translation, demonstrating significant improvements in noisy conditions.
This research paper introduces AlignVSR, a novel method for visual speech recognition (VSR) that leverages audio information to significantly improve the accuracy of lip-reading by aligning audio and visual modalities through a two-layer alignment mechanism.
The proposed DCIM-AVSR model introduces an efficient asymmetric architecture that prioritizes the audio modality while treating the visual modality as supplementary, enabling more effective integration of multi-modal information through the Dual Conformer Interaction Module (DCIM).
BRAVEn, an extension to the RAVEn method, learns strong visual and auditory speech representations entirely from raw audio-visual data, achieving state-of-the-art performance among self-supervised methods in various settings.