A unified template filling framework that connects textual and visual modalities via natural language prompts to effectively address the event argument extraction task.
This paper proposes a new paradigm called Open-vocabulary Multimodal Emotion Recognition (OV-MER) that enables the prediction of any number and category of emotions, advancing emotion recognition from basic to more nuanced emotions.
The addition of visual information, in various forms, improves both self-reported confidence and accuracy for next-word prediction in both humans and language models.
提案手法は、既存の音声と動画の拡散モデルを効果的に統合し、時系列調整とクロスモーダル条件付けの新しいメカニズムを導入することで、高品質かつ時間的に整合性の取れた音声付き動画を生成することができる。
Multimodal Prompt Tuning (MMPT) is a novel approach that effectively integrates visual and textual prompts into the vision encoder and language processor, respectively, to enable efficient and accurate multimodal adaptation for zero-shot instruction learning.
This paper presents a comprehensive survey on visual prompting methods in multimodal large language models (MLLMs), covering visual prompt generation, integration into MLLM perception and reasoning, and model alignment techniques.
Multimodal foundation models (MFMs) like LanguageBind and ImageBind outperform audio-only foundation models (AFMs) for non-verbal emotion recognition (NVER) tasks by better capturing subtle emotional cues through their joint pre-training across multiple modalities.
OmniBench is a novel benchmark designed to rigorously evaluate multimodal large language models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.
Current state-of-the-art large foundation models exhibit varying strengths and weaknesses in multimodal reasoning capabilities, with no single model outperforming others across all tasks. Detailed evaluation reveals opportunities for improvement in areas like geometric reasoning, benefiting from multimodal input, and grounding information retrieval.
ImageBindを使用して、オンラインオートパーツ広告の画像とテキストを融合したマルチモーダルな埋め込み表現を生成し、その意味的な質を分析した。さらに、純粋な音声埋め込みとの相関関係を示すことで、ImageBindの潜在的な応用分野を示唆した。