Khái niệm cốt lõi
Vision Transformers enable a novel single-modality gaze following framework, ViTGaze, achieving state-of-the-art performance in predicting human gaze targets.
Tóm tắt
ViTGaze introduces a new approach to gaze following using Vision Transformers. It focuses on extracting human-scene interactions through self-attention maps. The framework consists of a 4D interaction encoder and a 2D spatial guidance module. ViTGaze demonstrates superior performance compared to existing methods, with significant improvements in AUC and AP metrics. The model is efficient with fewer parameters and achieves SOTA results.
Introduction
Gaze following predicts a person's gaze target in an image.
Previous methods use multi-modality frameworks or query-based decoders.
Method
ViTGaze utilizes pre-trained plain Vision Transformers for gaze prediction.
Features are extracted using a 4D interaction encoder and guided by 2D spatial information.
Experiment
Evaluation on GazeFollow and VideoAttentionTarget datasets shows ViTGaze outperforms previous methods.
Ablation studies confirm the effectiveness of multi-level 4D features and 2D spatial guidance.
Conclusion
ViTGaze presents an innovative approach to gaze following, achieving high accuracy with efficient parameter usage.
Thống kê
"Our method achieves state-of-the-art (SOTA) performance among all single-modality methods."
"Our method gets a 3.4% improvement on AUC and 5.1% improvement on AP among single-modality methods."