Core Concepts
AVicuna introduces a novel framework for audio-visual understanding, addressing challenges in TRD with state-of-the-art performance.
Abstract
The content discusses AVicuna, a model focusing on Temporal Referential Dialogue (TRD) in audio-visual media. It introduces datasets like PU-VALOR and A5-222K to enhance audio-visual understanding. AVicuna's architecture, multi-stage fine-tuning process, experimental results, ablation studies, and potential impact are detailed.
Introduction
- Large Language Models (LLMs) advance natural language processing.
- Challenges in spatial and temporal details understanding.
Method
PU-VALOR Dataset
- Creation of pseudo-untrimmed videos with temporal annotations.
Architecture of AVicuna
- Multimodal Encoders, Connective Adapters, AVTI for temporal alignment.
Multi-stage Fine-tuning
- Vision/Audio-text alignment, Context-Boundary Alignment, Instruction Tuning.
Experimental Results
Quantitative Experiments
- Comparison with existing models on various tasks.
Ablation Study
- Impact of components like AVTI, modalities, datasets on performance.
Qualitative Analysis
- Examples showcasing AVicuna's accurate temporal understanding.
Discussion & Conclusion
- Limitations include hallucination and spatial comprehension issues.
Stats
PU-VALORデータセットは114k以上のアンシーン動画を提供。
A5-222Kデータセットには22万以上のオーディオテキストペアが含まれる。
Quotes
"AVicuna achieves advantageous performance in various video and audio-video understanding tasks."
"AVicuna surpasses all other LLM-based models on both video QA and AVQA benchmarks."