Core Concepts
This work explores how to leverage large language models and prompt engineering techniques to generate accurate text descriptions of comic strip content, including panel layouts, characters, dialogues, and visual elements. The goal is to provide visually impaired readers with a comprehensive textual representation that can be easily converted to audio or braille formats, enhancing the accessibility of comics.
Abstract
The paper explores methods to make visual arts, specifically comics, more accessible to blind and low-vision readers. The key insights are:
Current experimental tools for making comics accessible often lack comprehensive image and text descriptions, which are essential for visually impaired readers to fully understand and appreciate the content.
The authors propose a pipeline that combines computer vision techniques and large language models to generate detailed textual descriptions of comic strip content, including panel layouts, characters, dialogues, and visual elements.
The process involves extracting information from the comic strip images, such as panels, characters, text, reading order, and the association of speech bubbles and characters. This grounded context is then used to fine-tune large language models and generate accurate panel descriptions, character identification, and a comprehensive comic book script.
The generated script includes text type classification (sound, caption, dialogue), automatic character name inference, and contextual panel descriptions. This enriched content can be easily converted to audiobook or e-book formats with various voices for characters, captions, and sound effects.
The authors evaluate their approach on public domain English comics, demonstrating the effectiveness of the text type classification, character clustering, and name inference. They also present qualitative results on the generated panel descriptions, showcasing the level of detail and accuracy achieved.
The proposed method aims to address the key challenge of providing comprehensive image-related information, which has been identified as a bottleneck for making visual arts accessible at scale.
Stats
There are 45 panels, 75 character instances, and 109 text blocks in the "Escape with me" episode.
The "Patents" comic book has 12 pages, 85 panels, 132 text blocks, and 77 character instances.
Quotes
"Visual details and contexts which are essential to understand and feel the beauty of these artworks are often missing in current experimental tools."
"The generated description is intended to mirror the image content by providing an overall scene description with named characters, dialogues and interaction following the natural reading order."