Multimodal Transformer for Comics Text-Cloze: Enhancing Narrative Understanding in Comics Analysis
The author introduces a Multimodal Large Language Model architecture tailored for the comics text-cloze task, achieving significant improvements over existing models. The approach combines visual and textual elements to enhance narrative understanding in comics analysis.