核心概念
A novel training-free method for video anomaly detection that leverages pre-trained large language models and vision-language models to detect anomalies without any task-specific training or data collection.
要約
The paper introduces LAVAD, a training-free method for video anomaly detection (VAD) that leverages pre-trained large language models (LLMs) and vision-language models (VLMs).
The key highlights are:
-
LAVAD is the first training-free method for VAD, diverging from existing state-of-the-art methods that all require some form of training and data collection.
-
LAVAD consists of three main components:
- Image-Text Caption Cleaning: Uses cross-modal similarity between image and text embeddings to clean noisy captions generated by a captioning model.
- LLM-based Anomaly Scoring: Leverages the LLM to generate temporal summaries of the video frames and use them to estimate anomaly scores.
- Video-Text Score Refinement: Further refines the anomaly scores by aggregating scores from semantically similar video snippets.
-
Experiments on two benchmark datasets, UCF-Crime and XD-Violence, show that LAVAD outperforms both unsupervised and one-class VAD methods without requiring any training or data collection.
-
The authors conduct an extensive ablation study to validate the effectiveness of the proposed components and the impact of different design choices.
Overall, the paper presents a novel training-free approach to VAD that demonstrates the potential of leveraging large-scale foundation models for addressing challenging computer vision tasks without the need for task-specific training.
統計
"We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence)."
"The training set of UCF-Crime consists of 800 normal and 810 anomalous videos, while the test set includes 150 normal and 140 anomalous videos."
"XD-Violence is another large-scale dataset for violence detection, comprising 4754 untrimmed videos with audio signals and weak labels that are collected from both movies and YouTube."
引用
"Crucially, every existing method necessitates a training procedure to establish an accurate VAD system, and this entails some limitations. One primary concern is generalization: a VAD model trained on a specific dataset tends to underperform in videos recorded in different settings (e.g., daylight versus night scenes). Another aspect, particularly relevant to VAD, is the challenge of data collection, especially in certain application domains (e.g. video surveillance) where privacy issues can hinder data acquisition."
"Developing a training-free VAD model is hard due to the lack of explicit visual priors on the target setting. However, such priors might be drawn using large foundation models, renowned for their generalization capability and wide knowledge encapsulation."