Kernkonzepte
Generating accurate and character-aware audio descriptions for movies is a challenging task that requires fine-grained visual understanding and awareness of the characters and their names. This work proposes new datasets and architectures to advance the state-of-the-art in this domain.
Zusammenfassung
The paper makes three key contributions:
-
It proposes two new datasets for training and evaluating movie audio description (AD) generation models:
- CMD-AD: Constructed by aligning audio descriptions from AudioVault with movie clips from the CMD dataset. This provides video data aligned with AD annotations.
- HowTo-AD: Derived from the HowTo100M instructional video dataset, by transforming the existing captions into pseudo-AD with character names.
-
It develops two new architectures, Movie-BLIP2 and Movie-Llama2, that take raw video frames and character bank information as input to generate character-aware AD. These models leverage pre-trained visual and language models.
-
It introduces new evaluation metrics tailored for AD generation:
- CRITIC: Measures the accuracy of character naming in the generated AD.
- LLM-AD-eval: Uses large language models to assess the overall semantic quality of the generated AD.
The experiments show that the proposed datasets and architectures significantly outperform previous methods on both standard and new evaluation benchmarks for movie AD generation.
Statistiken
"102 movies / videos" in MAD dataset
"105 movies / videos" in CMD-AD (ours)
"106 descriptions" in HowTo-AD (ours)
Zitate
"Cinema is a matter of what's in the frame and what's out." - Martin Scorsese