The paper makes three key contributions:
It proposes two new datasets for training and evaluating movie audio description (AD) generation models:
It develops two new architectures, Movie-BLIP2 and Movie-Llama2, that take raw video frames and character bank information as input to generate character-aware AD. These models leverage pre-trained visual and language models.
It introduces new evaluation metrics tailored for AD generation:
The experiments show that the proposed datasets and architectures significantly outperform previous methods on both standard and new evaluation benchmarks for movie AD generation.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Teng... alle arxiv.org 04-23-2024
https://arxiv.org/pdf/2404.14412.pdfDomande più approfondite