Concetti Chiave
Proposing the M3AV dataset for multimodal academic content recognition and understanding tasks.
Sintesi
The M3AV dataset is introduced as a novel resource for evaluating AI models in recognizing and understanding academic lectures. It covers various fields like computer science, mathematics, and biomedical science. The dataset includes annotated speech transcriptions, slide texts, and additional papers to facilitate research in multimodal content analysis. Various benchmark tasks are proposed to evaluate models' performance in contextual speech recognition, speech synthesis, slide generation, and script generation.
Introduction
- Open-source academic video recordings are popular for sharing knowledge online.
- Lack of high-quality human annotations hinders multimodal content recognition datasets.
- The M3AV dataset aims to address this gap with annotated videos from diverse academic fields.
Data Extraction
- "M3AV serves as a benchmark to evaluate the ability to perform multimodal perception and academic knowledge comprehension."
- "Evaluations demonstrate that the diversity of M3AV makes it a challenging dataset."
Related Work
- Existing datasets focus on either recognizing multimodal content or understanding academic knowledge.
- M3AV combines both aspects by providing high-quality annotations for spoken and written words.
Data Creation Pipeline
- Videos collected from YouTube are annotated with speech transcriptions using expert ASR systems.
- Slide text is extracted using OCR techniques followed by manual corrections for complex content.
Benchmarks and Experiments
ASR and CASR Task
- Benchmark systems like AED and RNN-T are used for ASR tasks.
- TCPGen with GNN tree encodings improves rare word recognition in CASR tasks.
Spontaneous TTS Task
- MQTTS model outperforms Bark and SpeechT5 in generating natural conversational speech.
Slide and Script Generation Task
- LLaMA-2 + OCR performs well in generating scripts from slides.
- GPT-4V excels in both slide-to-script and script-to-slide generation tasks.
Conclusion & Limitations
The M3AV dataset offers valuable resources for AI research in multimodal content analysis. However, limitations exist regarding biases, domain coverage, visual information extraction, and external knowledge integration.
Statistiche
"M3AV serves as a benchmark to evaluate the ability to perform multimodal perception and academic knowledge comprehension."
"Evaluations demonstrate that the diversity of M3AV makes it a challenging dataset."