核心概念
Vision Transformers enable a novel single-modality gaze following framework, ViTGaze, achieving state-of-the-art performance in predicting human gaze targets.
要約
ViTGaze introduces a new approach to gaze following using Vision Transformers. It focuses on extracting human-scene interactions through self-attention maps. The framework consists of a 4D interaction encoder and a 2D spatial guidance module. ViTGaze demonstrates superior performance compared to existing methods, with significant improvements in AUC and AP metrics. The model is efficient with fewer parameters and achieves SOTA results.
-
Introduction
- Gaze following predicts a person's gaze target in an image.
- Previous methods use multi-modality frameworks or query-based decoders.
-
Method
- ViTGaze utilizes pre-trained plain Vision Transformers for gaze prediction.
- Features are extracted using a 4D interaction encoder and guided by 2D spatial information.
-
Experiment
- Evaluation on GazeFollow and VideoAttentionTarget datasets shows ViTGaze outperforms previous methods.
- Ablation studies confirm the effectiveness of multi-level 4D features and 2D spatial guidance.
-
Conclusion
- ViTGaze presents an innovative approach to gaze following, achieving high accuracy with efficient parameter usage.
統計
"Our method achieves state-of-the-art (SOTA) performance among all single-modality methods."
"Our method gets a 3.4% improvement on AUC and 5.1% improvement on AP among single-modality methods."