This paper proposes a novel cross-modal video summarization framework, V2Xum-LLaMA, that unifies different video summarization tasks into a single large language model's text decoder. The framework utilizes temporal prompts and task instructions to enable task-controllable video summarization.
The core message of this work is to introduce a novel cross-modal video summarization task that aims to generate semantically aligned video and text summaries from a long source video, and to establish a large-scale benchmark dataset VideoXum to facilitate research in this emerging area.