The authors investigate the CLIP image encoder by analyzing how individual model components affect the final representation. They decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands.
Interpreting the attention heads, the authors characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, they uncover an emergent spatial localization within CLIP.
The authors use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Their results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.
The authors start by showing that the last few attention layers have the most direct effect on the final image representation. They then propose an algorithm, TEXTSPAN, that finds a basis for each attention head where each basis vector is labeled by a text description. This reveals specialized roles for each head, such as capturing shapes, colors, or locations.
The authors present two applications of the identified head roles. First, they can reduce spurious correlations by removing heads associated with the spurious cue, improving worst-group accuracy on the Waterbirds dataset. Second, the representations of heads with a property-specific role can be used to retrieve images according to that property.
Next, the authors exploit the spatial structure provided by attention layers, decomposing the output across image locations. This yields a zero-shot image segmenter that outperforms existing CLIP-based zero-shot methods.
Finally, the authors consider the spatial structure jointly with the text basis, visualizing which image regions affect each text-labeled basis direction. This validates the text labels, showing that the regions with relevant properties (e.g. triangles) are the primary contributors to the corresponding basis direction.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies