Core Concepts
AVicuna introduces a novel framework for audio-visual understanding, addressing challenges in TRD and achieving state-of-the-art performance.
Abstract
The content introduces AVicuna, a model focusing on Temporal Referential Dialogue (TRD) in audio-visual media. It addresses challenges in this field by creating datasets like PU-VALOR and A5-222K to enhance audio-visual understanding. AVicuna incorporates an Audio-Visual Token Interleaver (AVTI) and Context-Boundary Alignment to achieve fine-grained understanding and temporal synchronism. Experimental results show superior performance in various video tasks.
Introduction:
- Humans use speech and gestures for Referential Dialogue (RD).
- Challenges exist in extending RD to the video domain.
Data Extraction:
- "We propose a practical yet straightforward pipeline that leverages the VALOR-32K dataset with all trimmed videos to synthesize untrimmed videos with temporal labels."
Architecture of AVicuna:
- Multimodal Encoders extract embeddings from vision and audio modalities.
Multi-stage Fine-tuning:
- Four critical stages: Vision-Text Alignment, Audio-Text Alignment, Context-Boundary Alignment, and Instruction Tuning.
Experimental Results:
- AVicuna surpasses other models in Video QA and AVQA benchmarks.
Ablation Study:
- Removing key components like AVTI or datasets leads to decreased performance.
Qualitative Analysis:
Three examples demonstrate AVicuna's ability to predict temporal intervals accurately.
Stats
"We propose a practical yet straightforward pipeline that leverages the VALOR-32K dataset with all trimmed videos to synthesize untrimmed videos with temporal labels."
Quotes
"We introduce a novel framework to generate PU-VALOR, an extensive audio-visual dataset comprising over 114,000 untrimmed videos with accurate temporal demarcations."
"Our experiments demonstrate that AVicuna can effectively handle TRD in audio-visual videos and achieve state-of-the-art performance on various tasks."