Główne pojęcia
The author proposes Video Anomaly Retrieval (VAR) as a new task to retrieve relevant anomalous videos using detailed descriptions, introducing large-scale benchmarks and a model called ALAN for VAR.
Streszczenie
The content discusses the importance of Video Anomaly Retrieval (VAR) in bridging the gap between literature and real-world applications. It introduces benchmarks UCFCrime-AR and XDViolence-AR, along with the ALAN model for VAR. The method involves anomaly-led sampling, video prompt-based masked phrase modeling, and cross-modal alignment.
The article highlights the challenges of VAR compared to traditional video retrieval methods due to long untrimmed videos. It explains the structure of ALAN, including encoders for video, text, and audio, as well as the anomaly-led sampling mechanism. The VPMPM pretext task is introduced for fine-grained associations in video-text retrieval. Additionally, cross-modal alignment techniques are discussed to match representations from different modalities.
Key metrics or figures used include R@K values for evaluation metrics on UCFCrime-AR and XDViolence-AR. Ablation studies on anomaly-led sampling, VPMPM, and cross-modal alignment are conducted to analyze their impact on performance. The influence of hyperparameters like α is also explored.
The qualitative analysis includes visualization of retrieval results on UCFCrime-AR and coarse caption retrieval performance with different lengths of captions.
Statystyki
Videos possess space-time information.
UCF-Crime dataset consists of 1900 untrimmed videos.
XD-Violence dataset contains 3954 long videos.
Average length of videos in UCFCrime-AR is 242s.
Average length of videos in XDViolence-AR is 164s.
Cytaty
"In reality, users tend to search a specific video rather than a series of approximate videos."
"Our VAR is considerably different from traditional video retrieval."
"Such a setup more meets realistic requirements."