The study introduces a framework to enhance news image captioning by focusing on entity recognition. It highlights the limitations of common MLLMs and presents experiments showing improved results in CIDEr score and entity generation. The proposed method aligns multimodal information to refine textual input context, leading to better performance.
The study emphasizes the importance of handling entities in news image captioning tasks and showcases the effectiveness of their proposed alignment framework. By conducting experiments on two datasets, the authors demonstrate superior results compared to existing models. The approach involves training models on multiple tasks simultaneously and refining textual input based on aligned multimodal information.
Key points include:
Til et annet språk
fra kildeinnhold
arxiv.org
Viktige innsikter hentet fra
by Junzhe Zhang... klokken arxiv.org 03-01-2024
https://arxiv.org/pdf/2402.19404.pdfDypere Spørsmål