The authors focus on the task of editing multimodal large language models (MLLMs), which is more complex than editing single-modal language models. They construct a new benchmark, MMEdit, to evaluate the reliability, locality, and generality of multimodal model editing approaches.
The MMEdit benchmark includes two subtasks: Editing Visual Question Answering (E-VQA) and Editing Image Captioning (E-IC). The authors follow single-modal model editing approaches to construct the datasets, extending the evaluation principles of reliability, locality, and generality to multimodal settings.
The authors evaluate several knowledge editing approaches on MMEdit and find that while current editing methods are effective for editing the textual model in the multimodal language model, they are not as effective for editing the vision module. For example, in editing the language module of the BLIP-2 model, the reliability of MEND can reach 99.4%, but it only attains 65.2% when editing the vision module, indicating the potential difficulty and opportunities of this task.
The authors also analyze the impact of editing different components of the multimodal model, finding that editing the vision module is more challenging than editing the language module. They argue that this difficulty may be attributed to the model's architecture, where the factual knowledge may be stored in separate parameters within the model, making it more challenging to edit the visual module.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Siyuan Cheng... at arxiv.org 04-19-2024
https://arxiv.org/pdf/2310.08475.pdfDeeper Inquiries