Zero-shot Referring Expression Comprehension via Structural Similarity Between Image Entities and Textual Descriptions
Leveraging the structural similarity between visual and textual triplets, comprising subject, predicate, and object, to accurately link textual descriptions to their corresponding image regions in a zero-shot setting.