Interpreting CLIP's Image Representation via Text-Based Decomposition
We decompose CLIP's image representation into text-interpretable components attributed to individual attention heads and image locations, revealing specialized roles for many heads and emergent spatial localization.