The paper presents a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features. Specifically, it introduces three augmentation methods:
Simple Augmentation (SA): Randomly duplicating or dropping subwords and frames to generate self-similar data.
Augmentation by Text Paraphrasing and Video Stylization (ATPVS): Using large language models (LLMs) and visual generative models (VGMs) to generate semantically similar videos and texts through paraphrasing and stylization.
Augmentation by Hallucination (AH): Using LLMs and VGMs to generate and add new relevant information to the original video and text data.
The authors show that these augmentation methods, especially AH, can significantly improve video-text retrieval performance on several benchmarks, including MSR-VTT, MSVD, and ActivityNet. The results demonstrate the effectiveness of leveraging large foundation models to enrich the training data and enhance the representation learning ability of video-text retrieval models.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Yimu Wang,Sh... a las arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.05083.pdfConsultas más profundas