Evaluating Vision-Language Models' Understanding of Compound Nouns
Open-vocabulary vision-language models like CLIP struggle to interpret compound nouns as effectively as they understand individual nouns, particularly when one noun acts as an attribute to the other.