The paper introduces Multimodal Multimedia Event Argument Extraction with Unified Template Filling (MMUTF), a novel approach for the event argument extraction (EAE) task in a multimodal setting. The key highlights are:
MMUTF employs a unified template filling framework that utilizes event templates as natural language prompts to connect textual and visual modalities. This enables the exploitation of cross-ontology transfer and incorporation of event-specific semantics.
The model encodes textual entities and visual objects as candidates, and computes matching scores between the candidates and argument role queries extracted from the event templates. This candidate-query matching forms the basis for the EAE task.
Experiments on the M2E2 benchmark demonstrate the effectiveness of the proposed approach. MMUTF surpasses the current state-of-the-art on textual EAE by 7% F1 and generally performs better than the second-best systems for multimedia EAE.
The authors also analyze the transfer learning capabilities of MMUTF by leveraging FrameNet and SWiG datasets, showing remarkable zero-shot performance on the M2E2 benchmark.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問