toplogo
Sign In

Unified Transformer Model for Predicting Human Attention Scanpaths in Visual Search and Free Viewing


Core Concepts
A single transformer-based model, Human Attention Transformer (HAT), can effectively predict human scanpaths in both top-down visual search and bottom-up free viewing tasks, outperforming previous state-of-the-art methods.
Abstract
The paper proposes a novel transformer-based model called Human Attention Transformer (HAT) that can predict human scanpaths in both top-down visual search and bottom-up free viewing tasks. Key highlights: HAT uses a transformer-based architecture and a simplified foveated retina to capture the dynamic visual working memory of humans. HAT avoids discretizing fixations and instead outputs dense heatmaps for each fixation, making it applicable to high-resolution inputs. HAT establishes new state-of-the-art performance in scanpath prediction across target-present, target-absent visual search, and free viewing tasks, which were previously studied separately. HAT's attention mechanism provides high interpretability, allowing insights into human gaze behavior. The paper first formulates scanpath prediction as a sequence of dense prediction tasks. It then introduces the HAT architecture, which consists of a feature extraction module, a foveation module, an aggregation module, and a fixation prediction module. The foveation module constructs a dynamic working memory by combining peripheral and foveal visual information. The aggregation module selectively attends to the working memory to produce task-specific predictions. HAT is trained using behavior cloning, minimizing the fixation heatmap loss and termination probability loss. Experiments show that HAT outperforms previous state-of-the-art methods on multiple datasets covering visual search and free viewing scenarios. Qualitative analysis also demonstrates HAT's ability to generate interpretable scanpath predictions that align with human gaze behavior.
Stats
5.086 2.606 1.977 3.103 1.600 1.524 3.382 1.655 1.418
Quotes
None

Deeper Inquiries

Can HAT's performance be further improved by incorporating additional modalities beyond visual input, such as semantic or contextual information

Incorporating additional modalities beyond visual input, such as semantic or contextual information, could potentially enhance HAT's performance in scanpath prediction. By integrating semantic information about objects, scenes, or tasks, HAT could better understand the context of the visual input and make more informed predictions about where human attention is likely to be directed. For example, semantic information could help HAT prioritize certain objects or regions based on their relevance to the task at hand, leading to more accurate and contextually relevant scanpath predictions. Contextual information, such as scene layout or object relationships, could also guide HAT in predicting scanpaths that align more closely with human behavior in various scenarios. By incorporating these additional modalities, HAT may achieve a more comprehensive understanding of the visual scene and improve its ability to predict human attention allocation.

How would HAT's predictions differ from human scanpaths in more complex or ambiguous visual scenes where top-down and bottom-up attention control mechanisms may conflict

In more complex or ambiguous visual scenes where top-down and bottom-up attention control mechanisms may conflict, HAT's predictions may exhibit a nuanced interplay between these two forms of attention. Top-down attention control, driven by task goals or expectations, may guide HAT to focus on specific objects or regions relevant to the task at hand. On the other hand, bottom-up attention control, influenced by salient visual features, may attract HAT's attention to unexpected or visually striking elements in the scene. In such scenarios, HAT's predictions may reflect a balance between task-driven attention and stimulus-driven attention, resulting in scanpaths that combine elements of both top-down and bottom-up processing. The conflict between these attention mechanisms could lead to more exploratory or varied scanpaths, as HAT navigates the complexity of the visual scene and the competing demands of different attentional processes.

What insights can HAT's attention mechanism provide about the underlying cognitive processes governing human visual attention allocation in different task scenarios

HAT's attention mechanism can provide valuable insights into the underlying cognitive processes governing human visual attention allocation in different task scenarios. By dynamically assimilating spatial, temporal, and visual information at each fixation, HAT creates a spatio-temporal awareness akin to human dynamic visual working memory. This mechanism allows HAT to discern task-specific attention weights for amalgamating information and forecasting human attention control. In visual search tasks, HAT's attention mechanism can shed light on how humans prioritize and shift attention based on task goals and visual cues. In free-viewing tasks, HAT's attention mechanism can reveal how humans explore and allocate attention in the absence of specific goals. By analyzing HAT's attention predictions across various attention-demanding scenarios, researchers can gain insights into the cognitive processes underlying human attention allocation, including the interplay between top-down goal-directed attention and bottom-up stimulus-driven attention.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star