Core Concepts
CausalChaos! is a challenging dataset for causal action question answering that requires models to comprehend longer causal chains, diverse reasoning types, and multi-level answers grounded in dynamic visual scenes of the Tom & Jerry cartoon series.
Abstract
The CausalChaos! dataset is constructed from the iconic Tom & Jerry cartoon series, which provides an abundance of dynamic visual scenes and cause-and-effect relationships. The dataset features the following key characteristics:
Multi-level answers: Each question is accompanied by a primary answer and a more detailed explanation, providing a richer and more nuanced perspective on the characters' actions and motivations.
Focus on causal reasoning in visually dynamic scenes: The dataset challenges models to comprehend the temporal and contextual flow of events amid rapid scene changes and dynamic interactions, requiring them to link multiple cues dispersed across different scenes.
Leveraging principles of animation: The cartoon visuals leverage principles such as timing, exaggeration, and staging to effectively communicate clear cause-and-effect relationships, allowing models to focus on deciphering the causal relationships.
Wide spectrum of reasoning types: The dataset demands various forms of reasoning, including deductive, inductive, spatial, causal, critical thinking, emotional, abductive, and temporal reasoning.
Character-centric questions: The dataset uses character names instead of generic nouns, necessitating character resolution and understanding character dynamics.
The dataset is benchmarked using state-of-the-art VideoQA models, revealing that while the models perform reasonably well, there is significant room for improvement, particularly in causal relationship modeling and open-ended answer generation. The authors suggest that incorporating their dataset can benefit VideoQA models on real-world datasets, and that leveraging the unique properties of cartoons can inform the model design process to better address the challenges faced in the real world.
Stats
The average length of causal chains in CausalChaos! is 2.7, significantly longer than the average length of 1 in existing causal VideoQA datasets.
The dataset contains an average of 4 scene changes per video clip, challenging models to link context and cues across different scenes.
Quotes
"Jerry wanted to pull Tom's tail into another pocket."
"Jerry was laughing at Tom who fell down the stairs and into the fountain."