The Heron-Bench is a novel benchmark for assessing the Japanese language capabilities of Vision Language Models (VLMs). It consists of a diverse set of image-question-answer pairs tailored to the Japanese context, enabling a comprehensive and culturally aware evaluation of VLMs.
The OmniFusion model integrates a pretrained large language model with specialized adapters for processing visual information, enabling superior performance on a range of visual-language benchmarks compared to existing open-source solutions.
VisualWebBench is a comprehensive multimodal benchmark designed to assess the capabilities of Multimodal Large Language Models (MLLMs) in the web domain, covering a variety of tasks such as captioning, webpage QA, OCR, grounding, and reasoning.
Idea-2-3D is a novel framework that leverages Large Multimodal Models (LMMs) and existing algorithmic tools to automatically generate 3D models from complex multimodal inputs (IDEAs) containing text, images, and 3D models.
Multimodal foundation models exhibit a consistent preference towards textual representations over visual representations when solving the same problems, in contrast with known human preferences.
Chat-UniVi empowers large language models to comprehend and engage in conversations involving images and videos through a unified visual representation.
Chat-UniVi introduces a unified vision-language model that comprehends and engages in conversations involving images and videos through dynamic visual tokens, outperforming existing methods.
Uni-AD proposes a unified framework for Audio Description (AD) generation, leveraging multimodal inputs and contextual information to enhance performance.
Introducing MiniGPT-5 and its innovative generative vokens approach for improved multimodal generation.
Griffon v2 introduces a high-resolution multimodal model with visual-language co-referring capabilities, achieving state-of-the-art performance in object detection, counting, REC, phrase grounding, and REG tasks.