The study explores a closure task in comics, introducing a novel Multimodal Large Language Model architecture specifically designed for text-cloze. By fine-tuning ResNet-50 to the comics domain and releasing new OCR annotations, the model achieves a 10% improvement over existing state-of-the-art models. The research extends the task to a generative format, paving the way for new possibilities in comics analysis.
Traditional methods based on recurrent neural networks have faced challenges due to limited OCR accuracy and model limitations. The introduction of a Domain-Adapted ResNet-50 based visual encoder, along with new OCR annotations, significantly improves model performance. The study focuses on tasks like text-cloze, visual-cloze, and character-coherence to probe the relationship between text and images in comics.
Recent advancements in multimodal large language models have revolutionized text processing by effectively handling long-range dependencies in both textual and visual data. The study applies the VL-T5 model to the text-cloze task in comics, showcasing its effectiveness in integrating visual and textual information.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor