BRAVEn, an extension to the RAVEn method, learns strong visual and auditory speech representations entirely from raw audio-visual data, achieving state-of-the-art performance among self-supervised methods in various settings.
The proposed DCIM-AVSR model introduces an efficient asymmetric architecture that prioritizes the audio modality while treating the visual modality as supplementary, enabling more effective integration of multi-modal information through the Dual Conformer Interaction Module (DCIM).