Alapfogalmak
The authors propose an Entity-Aware Multimodal Alignment Framework to improve news image captioning by addressing challenges with entity recognition in MLLMs.
Kivonat
The study introduces a framework to enhance news image captioning by focusing on entity recognition. It highlights the limitations of common MLLMs and presents experiments showing improved results in CIDEr score and entity generation. The proposed method aligns multimodal information to refine textual input context, leading to better performance.
The study emphasizes the importance of handling entities in news image captioning tasks and showcases the effectiveness of their proposed alignment framework. By conducting experiments on two datasets, the authors demonstrate superior results compared to existing models. The approach involves training models on multiple tasks simultaneously and refining textual input based on aligned multimodal information.
Key points include:
- Introduction of Entity-Aware Multimodal Alignment Framework for News Image Captioning.
- Challenges with entity recognition in MLLMs for news image captioning.
- Experiments showing improved results in CIDEr score and entity generation.
- Importance of aligning multimodal information to refine textual input context.
- Superior performance demonstrated through experiments on two datasets.
Statisztikák
Common MLLMs are not good at generating entities in zero-shot setting.
Proposed method achieves better results than previous state-of-the-art models in CIDEr score (72.33 -> 86.29) on GoodNews dataset and (70.83 -> 85.61) on NYTimes800k dataset.
Idézetek
"Our method achieves better results than previous state-of-the-art models."
"MLLMs are more powerful models but struggle with entity recognition."