The core message of this paper is to develop models that can effectively detect rhetorical and psychological persuasion techniques embedded within memes, leveraging both textual and visual modalities. The authors introduce an intermediate step of generating meme captions to bridge the gap between the textual and visual components, which improves the performance of their models.
RiVEG, a unified framework that reformulates Grounded Multimodal Named Entity Recognition (GMNER) into a joint Multimodal Named Entity Recognition (MNER), Visual Entailment (VE), and Visual Grounding (VG) task, leveraging large language models (LLMs) as a connecting bridge to address the limitations of existing GMNER methods.