Kernkonzepte
Scaling the number of positive and negative examples in contrastive learning can effectively improve the performance of composed image retrieval models.
Zusammenfassung
The paper proposes a method to efficiently scale the number of positive and negative examples in the Composed Image Retrieval (CIR) task, which aims to retrieve target images using a composed query consisting of a reference image and a modified text.
To address the lack of positive examples in CIR datasets, the authors propose a data generation method that leverages a multi-modal large language model to automatically construct triplets (reference image, modified text, target image) for CIR. This method can scale the number of positive examples from around 20k to 100k without using external datasets.
To introduce more negative examples during fine-tuning, the authors design a two-stage fine-tuning framework. In the first stage, the model is fine-tuned with in-batch negative sampling as in previous works. In the second stage, the target image encoder is frozen, and only the query encoder is fine-tuned. This allows the introduction of a large number of static negative representations from the entire candidate set, guiding the query encoder to optimize the representation space rapidly.
The authors extensively evaluate their method on two popular CIR datasets, FashionIQ and CIRR, and demonstrate that their approach effectively scales positive and negative examples, leading to state-of-the-art performance. They also show that their method can be applied to the zero-shot CIR setting, where the model is built without requiring human-labeled triplets for training, by automatically constructing positive and negative examples from image datasets.
Statistiken
The average token length of the modified text is 16.5 for FashionIQ and 20.9 for CIRR.
The number of triplets (reference image, modified text, target image) is scaled from 18k to 96k for FashionIQ and from 28k to 128k for CIRR.
Zitate
"To address the problem of lack of positives, we propose a data generation method by leveraging a multi-modal large language model to construct triplets for CIR."
"To introduce more negatives during fine-tuning, we design a two-stage fine-tuning framework for CIR, whose second stage introduces plenty of static representations of negatives to optimize the representation space rapidly."