Core Concepts
The core message of this paper is to improve the performance of interpretable embeddings for ad-hoc video search by constructing a large-scale video-text dataset with generated captions and developing a multi-word concept bank.
Abstract
The paper addresses two main problems in ad-hoc video search (AVS): the small size of available video-text datasets and the low quality of concept banks, which lead to failures on unseen queries and the out-of-vocabulary problem.
To tackle these issues, the paper proposes three key components:
Constructing a new large-scale video-text dataset named WebVid-genCap7M, which contains 7 million generated text-video pairs. This dataset is used for pre-training the interpretable embedding model.
Developing a multi-word concept bank based on syntax analysis to enhance the capability of the interpretable embedding model in modeling relationships between query words. The multi-word concept bank includes not only individual words but also common phrases like noun phrases, verb phrases, and prepositional phrases.
Integrating recent advanced text and visual features, such as CLIP, BLIP-2, and imagebind, into the interpretable embedding model.
The experimental results show that the integration of these three components significantly improves the performance of the interpretable embedding model on the TRECVid AVS datasets, outperforming most of the top-1 results reported over the past eight years. Specifically:
The multi-word concept bank boosts the concept-based search by around 60% on average compared to the word-only concept bank, making it competitive with the embedding-based search.
Pre-training on the WebVid-genCap7M dataset improves the embedding-based search, but degrades the concept-based search due to the ground truth problem.
Integrating the advanced text and visual features further enhances the overall performance, with the visual features contributing more than the textual features.
Overall, the proposed improvements to the interpretable embedding model establish a new state-of-the-art for the ad-hoc video search task.
Stats
The WebVid-genCap7M dataset contains 7 million generated text-video pairs for 1.44 million videos.
The multi-word concept bank has 14,528 concepts, with 9,465 being phrases.
62% of the phrases appear between 20 to 50 times, and 18% appear more than 100 times in the training corpus.
Quotes
"The experimental results show that the integration of the above-proposed elements doubles the R@1 performance of the AVS method on the MSRVTT dataset and improves the xinfAP on the TRECVid AVS query sets for 2016–2023 (eight years) by a margin from 2% to 77%, with an average about 20%."