The study explores a closure task in comics, introducing a novel Multimodal Large Language Model architecture specifically designed for text-cloze. By fine-tuning ResNet-50 to the comics domain and releasing new OCR annotations, the model achieves a 10% improvement over existing state-of-the-art models. The research extends the task to a generative format, paving the way for new possibilities in comics analysis.
Traditional methods based on recurrent neural networks have faced challenges due to limited OCR accuracy and model limitations. The introduction of a Domain-Adapted ResNet-50 based visual encoder, along with new OCR annotations, significantly improves model performance. The study focuses on tasks like text-cloze, visual-cloze, and character-coherence to probe the relationship between text and images in comics.
Recent advancements in multimodal large language models have revolutionized text processing by effectively handling long-range dependencies in both textual and visual data. The study applies the VL-T5 model to the text-cloze task in comics, showcasing its effectiveness in integrating visual and textual information.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Emanuele Viv... alle arxiv.org 03-07-2024
https://arxiv.org/pdf/2403.03719.pdfDomande più approfondite