Core Concepts
Large, sufficiently trained vision transformer models learn to repurpose low-informative tokens to store and process global image information, leading to artifacts in feature maps. Adding additional "register" tokens to the input sequence allows the model to isolate this behavior, resulting in smoother feature maps and improved performance on dense prediction tasks and unsupervised object discovery.
Abstract
The paper investigates artifacts observed in the feature maps of modern vision transformer models, including supervised (DeiT-III), text-supervised (OpenCLIP), and self-supervised (DINOv2) models. The authors find that these artifacts correspond to high-norm tokens that appear during inference, primarily in low-informative background areas of images, and are repurposed by the model for internal computations.
The authors propose a simple solution to this issue - adding additional "register" tokens to the input sequence of the Vision Transformer. These registers allow the model to isolate the behavior of repurposing low-informative tokens, resulting in smoother feature maps and attention maps.
The authors show that this solution fixes the artifact problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and leads to smoother feature maps and attention maps for downstream visual processing.
The key insights are:
Large, sufficiently trained vision transformer models learn to recognize redundant tokens and repurpose them to store and process global image information.
This behavior leads to artifacts in the feature maps, as the model discards local patch information in favor of global information.
Adding additional "register" tokens to the input sequence allows the model to isolate this behavior, resulting in smoother feature maps and improved performance.
The authors observe that the high-norm outlier tokens appear around the middle layers of the vision transformer, and only after a sufficiently long training of a sufficiently big transformer.
Linear probing experiments show that the outlier tokens hold less information about their original position in the image or the original pixels in their patch, but more global information about the image.
Stats
"a small fraction of the total sequence (around 2%)" of tokens have roughly 10x higher norm than the others.
These high-norm tokens appear around the middle layers of the vision transformer.
These high-norm tokens only appear after a sufficiently long training of a sufficiently big transformer.
Quotes
"the model discards the local information contained in these patches during inference."
"the model learns to recognize patches containing little useful information, and recycle the corresponding tokens to aggregate global image information while discarding spatial information."