The paper proposes a neuro-symbolic framework called ALGO (Action Learning with Grounded Object recognition) to tackle the problem of open-world egocentric activity recognition. The key ideas are:
Evidence-based Object Grounding: ALGO uses a novel neuro-symbolic prompting approach to ground objects in the video by leveraging object-centric vision-language foundation models (like CLIP) as a noisy oracle and reasoning over compositional properties from a commonsense knowledge base (ConceptNet).
Object-driven Activity Discovery: Driven by prior commonsense knowledge, ALGO discovers plausible activities (verb-noun combinations) through an energy-based symbolic pattern theory framework. It then learns to ground the inferred action (verb) concepts in the video.
Visual-Semantic Action Grounding: ALGO learns to ground the inferred actions from the activity interpretations by bootstrapping a simple mapping function from video features to the semantic embedding space of the commonsense knowledge base.
The proposed approach is evaluated on two egocentric video datasets, GTEA Gaze and GTEA Gaze Plus, demonstrating its performance on open-world activity inference and generalization to unseen actions. ALGO also shows competitive performance compared to state-of-the-art vision-language models in zero-shot settings.
To Another Language
from source content
arxiv.org
Djupare frågor