The paper presents a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features. Specifically, it introduces three augmentation methods:
Simple Augmentation (SA): Randomly duplicating or dropping subwords and frames to generate self-similar data.
Augmentation by Text Paraphrasing and Video Stylization (ATPVS): Using large language models (LLMs) and visual generative models (VGMs) to generate semantically similar videos and texts through paraphrasing and stylization.
Augmentation by Hallucination (AH): Using LLMs and VGMs to generate and add new relevant information to the original video and text data.
The authors show that these augmentation methods, especially AH, can significantly improve video-text retrieval performance on several benchmarks, including MSR-VTT, MSVD, and ActivityNet. The results demonstrate the effectiveness of leveraging large foundation models to enrich the training data and enhance the representation learning ability of video-text retrieval models.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Yimu Wang,Sh... kl. arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.05083.pdfDybere Forespørgsler