Enhancing Video-Text Retrieval through Augmentation using Large Foundation Models
Leveraging large language models and visual generative models to augment video and text data can significantly improve the representation learning ability and performance of video-text retrieval models.