Core Concepts
A single transformer-based model, Human Attention Transformer (HAT), can effectively predict human scanpaths in both top-down visual search and bottom-up free viewing tasks, outperforming previous state-of-the-art methods.
Abstract
The paper proposes a novel transformer-based model called Human Attention Transformer (HAT) that can predict human scanpaths in both top-down visual search and bottom-up free viewing tasks.
Key highlights:
HAT uses a transformer-based architecture and a simplified foveated retina to capture the dynamic visual working memory of humans.
HAT avoids discretizing fixations and instead outputs dense heatmaps for each fixation, making it applicable to high-resolution inputs.
HAT establishes new state-of-the-art performance in scanpath prediction across target-present, target-absent visual search, and free viewing tasks, which were previously studied separately.
HAT's attention mechanism provides high interpretability, allowing insights into human gaze behavior.
The paper first formulates scanpath prediction as a sequence of dense prediction tasks. It then introduces the HAT architecture, which consists of a feature extraction module, a foveation module, an aggregation module, and a fixation prediction module.
The foveation module constructs a dynamic working memory by combining peripheral and foveal visual information. The aggregation module selectively attends to the working memory to produce task-specific predictions. HAT is trained using behavior cloning, minimizing the fixation heatmap loss and termination probability loss.
Experiments show that HAT outperforms previous state-of-the-art methods on multiple datasets covering visual search and free viewing scenarios. Qualitative analysis also demonstrates HAT's ability to generate interpretable scanpath predictions that align with human gaze behavior.
Stats
5.086
2.606
1.977
3.103
1.600
1.524
3.382
1.655
1.418