Attribute Recognition Through Vision-Based Prefix Language Modeling
Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. The proposed approach models the image-object-attribute conditional probabilities using a large vision-language model trained with prefix language modeling, and applies a novel generative retrieval method to effectively distill this knowledge for downstream attribute recognition tasks.