A novel strategy, AdaptSign, is proposed to efficiently adapt large vision-language models like CLIP for continuous sign language recognition, by introducing lightweight modules to inject domain-specific knowledge while preserving the generalizability of the pretrained model.
The proposed Denoising-Diffusion Alignment (DDA) method leverages diffusion-based global alignment techniques to effectively align video and gloss sequence, facilitating global temporal context alignment and improving continuous sign language recognition performance.
This study constructs a large-scale Chinese Continuous Sign Language dataset (CE-CSL) oriented towards practical application environments, featuring diverse real-world backgrounds, and proposes a Time-Frequency Network (TFNet) model that achieves efficient and accurate continuous sign language recognition by extracting sequence features in both the temporal and frequency domains.