Core Concepts
Proposing AV-SUPERB benchmark to evaluate audio-visual representation models across various tasks and datasets, highlighting the need for universal model performance.
Abstract
The content introduces the AV-SUPERB benchmark, emphasizing the importance of evaluating audio-visual representation models. It discusses the limitations of current models in generalizing across tasks and presents insights from evaluating self-supervised models on different datasets. The benchmark comprises three tracks for assessing audio-only, video-only, and audio-visual fusion representations. Results show that existing models excel in specific tasks but struggle to generalize across all tasks. The study also explores the impact of intermediate-task fine-tuning on model performance and analyzes layer-wise contributions to task performance.
Abstract:
- Proposes AV-SUPERB benchmark for evaluating audio-visual representation models.
- Highlights limitations of current models in generalization across tasks.
Introduction:
- Discusses the goal of emulating human cognition through multitasking algorithms.
Benchmark Details:
- Introduces three evaluation tracks for assessing different types of representations.
Experimental Results and Discussion:
- Evaluates model performance across various speech and audio processing tasks.
When does Visual Grounding Improve Audio Representation Learning?:
- Compares HuBERT and AV-HuBERT results to analyze the impact of visual grounding on representations.
Layer-wise Contribution Analysis:
- Analyzes layer utilization in different models for various tasks.
How Does Intermediate-task Fine-tuning Affect Performance?:
- Explores the impact of intermediate-task fine-tuning on model performance.