The paper proposes a novel gaze estimation framework called GazeCLIP that exploits the synergistic effects of text-image features. The key contributions are:
Introducing a text-guided gaze estimation approach that leverages the powerful language-vision knowledge from the pre-trained CLIP model. This is the first attempt to utilize large-scale language supervision for enhancing gaze estimation.
Designing a cross-attention mechanism to finely align the image features with the semantic guidance embedded in the textual signals, enabling more discriminative gaze representations.
Extensive experiments on three challenging datasets demonstrate the superiority of the proposed GazeCLIP, achieving state-of-the-art performance with significant improvements over previous methods (9.3% reduction in angular error on average).
The paper first provides an overview of the evolution of gaze estimation techniques, highlighting the limitations of existing appearance-based approaches that solely rely on facial/eye images. It then introduces the GazeCLIP framework, which consists of a CLIP-based image encoder, a text encoder, and a cross-attention fusion module.
The image encoder extracts visual features from the input face image, while the text encoder generates semantic embeddings from predefined language prompts that describe the gaze direction. The cross-attention mechanism is designed to align the image and text features, producing refined image representations that capture the nuanced semantics. Finally, the enhanced image embeddings are fed into a regression head to predict the gaze direction.
The authors conduct extensive experiments on three benchmark datasets (MPIIFaceGaze, RT-Gene, and EyeDiap) and demonstrate that GazeCLIP outperforms state-of-the-art gaze estimation methods by a significant margin. The paper also includes detailed ablation studies to validate the effectiveness of the key components in the proposed framework.
To Another Language
from source content
arxiv.org
Principais Insights Extraídos De
by Jun Wang,Hao... às arxiv.org 04-29-2024
https://arxiv.org/pdf/2401.00260.pdfPerguntas Mais Profundas