Enhancing Text-to-Video Retrieval by Leveraging Automatic Image Captioning
Automatic image captioning models can provide a useful supervision signal to train text-to-video retrieval models on unlabeled video data, outperforming strong zero-shot baselines.