Discovering Novel Actions in Egocentric Videos through Object-Grounded Visual Commonsense Reasoning
A neuro-symbolic framework that leverages object-centric vision-language models and commonsense knowledge to discover novel actions and infer activities in egocentric videos without explicit supervision.