Core Concepts
FLOCODE employs flow-aware temporal consistency, correlation debiasing, label correlation, and uncertainty attenuation to generate unbiased dynamic scene graphs from videos.
Abstract
The paper proposes FLOCODE, a method for generating unbiased dynamic scene graphs from videos. It addresses key challenges in dynamic scene graph generation, such as biased scene graph generation and the long-tailed distribution of visual relationships.
Key highlights:
Temporal Flow-Aware Object Detection (TFoD) leverages flow-warped features to ensure temporal consistency in object detection across video frames.
Correlation-Aware Predicate Embedding models spatial, temporal, and predicate-object correlations using a Transformer encoder-decoder architecture.
Debiased Predicate Embedding updates the correlation matrices as a weighted average to generate debiased predicate embeddings.
Uncertainty-Aware Mixture of Attenuated Loss (LMAL) and Uncertainty-Aware Supervised Contrastive Learning (LMCL) handle noisy annotations and capture label correlations, respectively.
Extensive experiments on the Action Genome benchmark demonstrate significant performance improvements over state-of-the-art methods, with gains of up to 4.1% in mean-Recall@K.
The proposed framework, FLOCODE, offers a robust solution for capturing accurate scene representations in dynamic environments by addressing key challenges in dynamic scene graph generation.
Stats
The paper does not provide any specific numerical data or statistics to support the key logics. The results are presented in the form of comparative performance metrics on the Action Genome benchmark.
Quotes
There are no direct quotes from the content that support the key logics.