Enhancing Audio-Visual Wake Word Spotting with Frame-Level Cross-Modal Attention
The author proposes a Frame-Level Cross-Modal Attention module to improve the performance of Audio-Visual Wake Word Spotting systems by modeling multi-modal information at the frame level. The approach involves training an end-to-end FLCMA based Audio-Visual Conformer and fine-tuning pre-trained uni-modal models, achieving a new state-of-the-art result on the MISP dataset.