The content presents a novel self-supervised learning method for video anomaly detection called Patch Spatio-Temporal Relation Prediction (PSTRP). The key highlights are:
PSTRP introduces a two-stream vision transformer network to capture deep visual features of video frames, addressing spatial and temporal dimensions responsible for modeling appearance and motion patterns, respectively.
The inter-patch relationship in each dimension is decoupled into inter-patch similarity and the order information of each patch. To mitigate memory consumption, the order information prediction task is converted into a multi-label learning problem, and the inter-patch similarity prediction task is formulated as a distance matrix regression problem.
Comprehensive experiments demonstrate the effectiveness of PSTRP, surpassing pixel-generation-based methods by a significant margin across three public benchmarks. PSTRP also outperforms other self-supervised learning-based methods.
The object extraction module is used to extract regions of interest (ROIs) from each frame, and the extracted STCs are spatially and temporally divided to generate inputs for the order prediction module.
The distance constraint module is introduced to guide the encoder in learning accurate relations among patches, enabling the model to capture deep video features and spatio-temporal relations.
Ablation studies validate the effectiveness of the object optimization module and distance constraint module in improving the performance of video anomaly detection.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Hao Shen,Lu ... at arxiv.org 03-29-2024
https://arxiv.org/pdf/2403.19111.pdfDeeper Inquiries