This paper presents a novel approach for automatically matching images to text in the specific context of matching application images to the search phrases that users might use to discover those applications. The authors share a new fine-tuning approach for a pre-trained cross-modal model, tuned to search-text and application-image data.
The key highlights and insights are:
The authors evaluate their matching approach in two ways: 1) using the application developers' intuitions about which search phrases are most relevant to their application images, and 2) using the intuitions of professional human annotators about which search phrases are most relevant to a given application.
The authors' approach achieves state-of-the-art performance, outperforming current models by 8-17% in terms of AUC (Area Under the Curve) on the two evaluation datasets.
The authors suggest that the performance lift is likely due to their use of a cross-modal encoder, which is more effective at identifying the relationship between text and images compared to earlier fusion approaches used by baseline models.
The authors conclude that their work opens the door to automatic image selection and self-serve capabilities for app developers, allowing them to easily identify the best images to promote their applications.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Alex Kim,Jia... at arxiv.org 05-02-2024
https://arxiv.org/pdf/2405.00029.pdfDeeper Inquiries