Enhancing Lip Reading Performance with Multi-Scale Video Data and Multi-Encoder Architectures
The authors propose a novel approach to enhance automatic lip reading (ALR) performance by incorporating multi-scale video data and multi-encoder architectures, including the recently introduced Branchformer and E-Branchformer encoders. Their method achieves state-of-the-art results on the ICME 2024 ChatCLR Challenge Task 2.