Extracting and Reasoning over Object Behaviors to Recognize Adverb Types in Video Clips
A novel framework that extracts object-behavior facts from video clips, reasons over those facts using transformers, and predicts the adverb types that best describe the overall video content.