Core Concepts
Curating a dataset of 10,136 user-generated funny short-form videos from YouTube, annotated with timestamps and explanations of humorous moments, and developing a zero-shot video-to-text prompting approach to improve language models' ability to explain video humor.
Abstract
The authors curate the ExFunTube dataset, which consists of 10,136 user-generated funny short-form videos from YouTube. Each video is annotated with timestamps and text explanations of the funny moments. The dataset aims to assess how well AI models can understand and explain video humor.
The authors develop a four-step video filtering pipeline to ensure that the collected videos exhibit multimodal humor, with both verbal and visual elements contributing to the humor. The pipeline uses GPT-3.5 to verify the presence of humor in the videos.
The authors then explore an approach to explain video humor by converting the video content into fine-grained text and leveraging powerful language models (LLMs) in a zero-shot manner. They design a zero-shot video-to-text prompting that extracts information from the visual, speech, and sound modalities of the videos and arranges them chronologically as a text prompt for the LLMs.
The authors evaluate the performance of their approach using three different methods: model-based automatic scores, rationale quality experiments, and human evaluations. The results show that their prompting approach significantly improves the humor explanation performance of three LLMs, including one zero-shot GPT-3.5 and two finetuned T5 and BART models, compared to text-only baselines and other multimodal approaches.
The authors also analyze the LLMs' performance across different humor categories and find that their prompting approach is particularly effective in explaining humor that heavily relies on visual elements, such as clownish humor, visual gags, and slapstick.
Stats
"A white dog with blue eyes being fed some kind of flower."
"A person feeds a flower to a white husky."
"The husky dog eats a flower with his paw."
"A person holding out his thumb to a husky."