Generating Character-Aware Audio Descriptions for Movies from Pixels
Generating accurate and character-aware audio descriptions for movies is a challenging task that requires fine-grained visual understanding and awareness of the characters and their names. This work proposes new datasets and architectures to advance the state-of-the-art in this domain.