Vision Transformers enable a novel single-modality gaze following framework, ViTGaze, achieving state-of-the-art performance in predicting human gaze targets.
ViTGaze introduces a novel single-modality gaze following framework based on pre-trained plain Vision Transformers, achieving state-of-the-art performance in gaze prediction.