The paper introduces a baseline method and an experimental protocol for detecting discrepancies between the audio and visual modalities in multimedia content. The authors first design and optimize an audio-visual scene classifier, which is then used to compare the audio and visual modalities separately to identify any inconsistencies between them.
To facilitate further research and provide a common evaluation platform, the authors introduce an experimental protocol and a benchmark dataset simulating such inconsistencies. The dataset, called VADD, is created by swapping the audio and video streams for half of the videos in the existing TAU dataset, while keeping the other half unchanged.
The proposed baseline method achieves state-of-the-art results in scene classification on the TAU dataset and promising outcomes in audio-visual discrepancies detection on the VADD dataset. The authors highlight the potential of their approach in content verification applications.
The key steps of the proposed method are:
The authors also conduct an ablation study to analyze the impact of different design choices, such as the placement of self-attention layers and the use of data augmentation, on the scene classification performance.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문