Improving Audio-Visual Speech Recognition with Lip-Subword Correlation
The author proposes novel techniques to enhance audio-visual speech recognition by correlating lip shapes with syllable-level subword units and introducing an audio-guided Cross-Modal Fusion Encoder. These methods aim to improve alignment between video and audio streams, utilizing modality complementarity effectively.