核心概念
This paper introduces a novel approach to generating accurate and contextually relevant soccer game commentary by addressing the crucial issue of temporal misalignment between video footage and textual descriptions in existing datasets.
统计
The temporal discrepancy between textual commentary and visual content in the existing benchmark can exceed 100 seconds.
Only 26.29% of the data falls within a 10-second window around the key frames in the original dataset.
The proposed approach reduces the average absolute offset by 7.0 seconds.
Nearly all (98.17%) textual commentaries align within a 60-second window surrounding the key frames after alignment.
The proportion of commentary that aligns within a precise 10-second window increases dramatically by 45.41% after alignment.
引用
"This paper aims to develop an high-quality, automatic soccer commentary system."
"Through manual annotation, we find that the temporal discrepancy between the textual commentary and the visual content in the existing benchmark can even exceed 100 seconds."
"Our alignment pipeline enables to significantly mitigate the temporal offsets between the visual content and textual commentaries, resulting in an higher-quality soccer game commentary dataset, named MatchTime."