The paper introduces Multimodal Multimedia Event Argument Extraction with Unified Template Filling (MMUTF), a novel approach for the event argument extraction (EAE) task in a multimodal setting. The key highlights are:
MMUTF employs a unified template filling framework that utilizes event templates as natural language prompts to connect textual and visual modalities. This enables the exploitation of cross-ontology transfer and incorporation of event-specific semantics.
The model encodes textual entities and visual objects as candidates, and computes matching scores between the candidates and argument role queries extracted from the event templates. This candidate-query matching forms the basis for the EAE task.
Experiments on the M2E2 benchmark demonstrate the effectiveness of the proposed approach. MMUTF surpasses the current state-of-the-art on textual EAE by 7% F1 and generally performs better than the second-best systems for multimedia EAE.
The authors also analyze the transfer learning capabilities of MMUTF by leveraging FrameNet and SWiG datasets, showing remarkable zero-shot performance on the M2E2 benchmark.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Philipp Seeb... lúc arxiv.org 10-03-2024
https://arxiv.org/pdf/2406.12420.pdfYêu cầu sâu hơn