Core Concepts
Visual scene context can make referring expression generation models more resilient to perturbations in target object representations, enabling them to identify referent types even when target information is completely missing.
Abstract
This paper investigates the role of visual scene context in improving the resilience of referring expression generation (REG) models to imperfect target representations. The authors train and test Transformer-based REG models with varying degrees of noise added to the target object representations, and evaluate their performance using automatic metrics as well as a focused human evaluation on the validity of assigned referent types.
The results show that even simple scene context information can make REG models surprisingly resilient to target perturbations. When target representations are entirely occluded by noise, models with access to visual context can still identify referent types with high accuracy, outperforming models that only have access to the target information. This effect is more pronounced for non-human object classes, where context seems to provide stronger associations.
Further analysis reveals that the REG models learn to exploit the co-occurrence of similar objects in the visual context to compensate for missing target information. The models allocate more attention to context objects of the same class as the referent, especially in the decoder. This suggests that REG models can leverage regular patterns of object co-occurrence in scenes to generate adequate descriptions, even when the target itself is not clearly visible.
The authors also conduct experiments on a more diverse and less standardized dataset (PACO-EGO4D), which show that the effectiveness of visual context is not universally applicable, but depends on the characteristics of the data. Overall, the findings offer new perspectives on the role of scene context in visual REG, going beyond the traditional view of context as a source of distraction.
Stats
The target object is a couch on the right side of the image.
The visual context contains a brown chair on the right side of the image.
Quotes
"Scene context is well known to facilitate hu-mans' perception of visible objects."
"We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular."