The paper investigates the problem of paraphrased text-image retrieval, where a model should return similar visual search results for a pair of paraphrased language queries. The authors first collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation of this task.
They hypothesize that the undesired behavior of existing dual-encoder models like CLIP, which often return very different top retrievals for paraphrased queries, is due to their text towers being trained on limited image-sentence pairs. To address this, the authors explore multiple strategies for training a dual-encoder model starting from a large pretrained language model.
The key findings are:
The authors demonstrate the effectiveness of their approach on both small-scale (COCO) and large-scale (LAION-400M) datasets, highlighting the benefits of leveraging pretrained language models for improving dual-encoder vision-language models.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Jiacheng Che... klo arxiv.org 05-07-2024
https://arxiv.org/pdf/2405.03190.pdfSyvällisempiä Kysymyksiä