Neural-Symbolic Video Question Answering: Enabling Compositional Spatio-Temporal Reasoning for Real-World Videos
The core message of this paper is to propose a neural-symbolic framework called Neural-Symbolic VideoQA (NS-VideoQA) that enables effective compositional spatio-temporal reasoning for real-world video question answering tasks. The framework transforms unstructured video data into a symbolic representation capturing persons, objects, relations, and action chronologies, and then performs iterative symbolic reasoning to answer compositional questions.