The key highlights and insights of this content are:
The authors propose a new framework for adverb-type recognition in video clips, which consists of three phases: Extraction, Reasoning, and Prediction.
In the Extraction phase, the framework extracts discrete object-behavior facts from raw video clips using a pipeline that detects objects, computes their optical flow, and represents the information in an Answer Set Programming (ASP) format.
For the Reasoning phase, the authors explore two approaches: a single-step symbolic-based baseline that learns indicator rules using FastLAS, and a novel transformer-based method that performs masked language modeling over the extracted object-behavior facts to learn summary representations.
In the Prediction phase, the framework uses the learned object-behavior representations, concatenated with action-type embeddings, to train separate SVM classifiers to distinguish between each adverb and its antonym.
The authors release two new datasets, MSR-VTT-ASP and ActivityNet-ASP, which contain the extracted object-behavior facts and adverb annotations for subsets of the MSR-VTT and ActivityNet video datasets.
Experimental results show that the transformer-based reasoning approaches outperform the previous state-of-the-art methods on the MSR-VTT-ASP and ActivityNet-ASP datasets, demonstrating the effectiveness of reasoning over object behaviors for adverb-type recognition.
To Another Language
from source content
arxiv.org
ข้อมูลเชิงลึกที่สำคัญจาก
by Amrit Diggav... ที่ arxiv.org 03-29-2024
https://arxiv.org/pdf/2307.04132.pdfสอบถามเพิ่มเติม