Multimodal Language Models

洞察 - Multimodal Language Models

Enhancing Spatial-Temporal Reasoning in Multimodal Language Models Using Coarse Correspondences

Coarse Correspondences, a simple visual prompting method using object tracking, significantly improves spatial-temporal reasoning in multimodal language models without requiring architectural changes or task-specific fine-tuning.

Visual SKETCHPAD: Enhancing Multimodal Language Models with Sketch-Based Reasoning

Integrating a visual sketchpad with drawing tools into multimodal language models significantly improves their reasoning abilities in both mathematical and visual domains, enabling them to solve complex problems by generating and interpreting visual representations.

Visual Anchors: Efficient Information Aggregators for Multimodal Large Language Models

By identifying and leveraging "visual anchors" – key points of visual information aggregation within image data – the Anchor Former (AcFormer) offers a more efficient and accurate approach to connecting visual data with large language models.

Reka Core, Flash, and Edge: A Series of Powerful and Versatile Multimodal Language Models

Reka Core, Flash, and Edge are a series of powerful multimodal language models developed by Reka that can process and reason with text, images, video, and audio inputs, outperforming many larger models on a range of language and vision tasks.

Probing Multimodal Large Language Models for Global and Local Semantic Representations

Intermediate layers of Multimodal Large Language Models encode more global semantic information than the topmost layers.

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

新しいモデルMiniGPT-5は、画像とテキストの生成を統合するために「generative vokens」を導入し、多様なベンチマークで効果的な改善を実証します。

Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages

低リソース言語のための多モーダルLLMを開発するための方法を探る。

关于

产品

资源