Neuro-Symbolic Video Search: Enhancing Scene Identification with Temporal Logic Reasoning
Concepts de base
Decoupling semantic understanding and temporal reasoning is essential for efficient scene identification.
Résumé
The surge in video data production requires tools for efficient frame extraction.
State-of-the-art models fail at long-term reasoning due to intertwining perception and reasoning.
Proposal of a system using vision-language models and temporal logic for improved event identification.
Introduction of TL-based reasoning improving F1 score by 9-15% compared to benchmarks.
Implementation details provided on the NSVS-TL pipeline.
Dataset compilation and ground truth specifications explained.
Evaluation metrics, results, and comparison with LLM-based reasoning presented.
Introduction
Surge in video data production demands efficient tools for frame extraction.
Key Insights
State-of-the-art models struggle with long-term reasoning due to intertwined perception and reasoning.
Proposal of a system leveraging vision-language models and temporal logic for improved event identification.
Methodology
NSVS-TL framework segregates temporal reasoning from perception, enhancing scene identification efficiency.
Datasets
Introduction of synthetic TLV datasets created from COCO and ImageNet images.
Annotation of autonomous vehicle datasets Waymo and NuScenes with TL specifications.
Results
Impact of neural perception models on NSVS-TL performance evaluated across various datasets.
Conclusion
NSVS-TL enhances video understanding through integration of semantic understanding with temporal reasoning.
Neuro-Symbolic Video Search
Stats
Long-term temporal reasoning is key desideratum for frame retrieval systems.
The proposed system improves the F1 score by 9 − 15% compared to benchmarks using GPT4 on self-driving datasets like Waymo and NuScenes.
State-of-the-art computer vision models such as YOLO V8, Grounding Dino, Masked R-CNN, CLIP used in the evaluation process.
NSVS-TL maintains consistent performance even in videos spanning up to 40 minutes or 2400 seconds.
Citations
"Decoupling but co-designing semantic understanding and temporal reasoning is essential for efficient scene identification."
"Our TL-based reasoning improves the F1 score of complex event identification by 9 − 15% compared to benchmarks."
"NSVS-TL maintains consistent performance throughout different video lengths."