Coarse Correspondences, a simple visual prompting method using object tracking, significantly improves spatial-temporal reasoning in multimodal language models without requiring architectural changes or task-specific fine-tuning.
Integrating a visual sketchpad with drawing tools into multimodal language models significantly improves their reasoning abilities in both mathematical and visual domains, enabling them to solve complex problems by generating and interpreting visual representations.
By identifying and leveraging "visual anchors" – key points of visual information aggregation within image data – the Anchor Former (AcFormer) offers a more efficient and accurate approach to connecting visual data with large language models.
Reka Core, Flash, and Edge are a series of powerful multimodal language models developed by Reka that can process and reason with text, images, video, and audio inputs, outperforming many larger models on a range of language and vision tasks.
Intermediate layers of Multimodal Large Language Models encode more global semantic information than the topmost layers.
新しいモデルMiniGPT-5は、画像とテキストの生成を統合するために「generative vokens」を導入し、多様なベンチマークで効果的な改善を実証します。
低リソース言語のための多モーダルLLMを開発するための方法を探る。