Ha, S., Kim, C., Kim, D., Lee, J., Lee, S., & Lee, J. (2024). Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation. arXiv preprint arXiv:2411.01494.
This paper introduces a novel data augmentation method called Negative-mined Mosaic Augmentation (NeMo) to address the challenge of limited performance in Referring Image Segmentation (RIS) models, particularly in scenarios with high visual ambiguity and complex linguistic expressions.
NeMo augments training images by combining them with three carefully selected negative images in a 2x2 mosaic grid. These negative images are chosen based on their relevance to the referring expression using a pre-trained cross-modal model like CLIP. The method incorporates two key hyperparameters: τ, a threshold for filtering out excessively similar images, and K, the number of top negative image candidates considered.
NeMo consistently improves the performance of five state-of-the-art RIS models across four benchmark datasets (RefCOCO, RefCOCO+, G-Ref, and GRES). The improvement is particularly significant on datasets with higher visual-linguistic complexity, such as G-Ref and GRES. The ablation study demonstrates the importance of carefully tuning the hyperparameters τ and K to achieve optimal performance.
The authors conclude that NeMo effectively addresses the data bottleneck in RIS by providing models with more challenging and realistic training examples. This leads to improved visual and linguistic understanding, enabling the models to better handle complex referring expressions and ambiguous scenes.
This research significantly contributes to the field of Referring Image Segmentation by introducing a simple yet effective data augmentation technique that can be easily integrated into existing RIS pipelines. The findings highlight the importance of data quality and complexity in training robust and accurate RIS models.
The authors acknowledge that NeMo's performance might be limited when applied to datasets with highly diverse image domains. Future research could explore more sophisticated methodologies for negative image selection, such as object-level parsing, to further enhance the augmentation process.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Seongsu Ha, ... at arxiv.org 11-05-2024
https://arxiv.org/pdf/2411.01494.pdfDeeper Inquiries