Multimodal AI

insight - Multimodal AI

Heron-Bench: A Benchmark for Evaluating Japanese Vision Language Models

The Heron-Bench is a novel benchmark for assessing the Japanese language capabilities of Vision Language Models (VLMs). It consists of a diverse set of image-question-answer pairs tailored to the Japanese context, enabling a comprehensive and culturally aware evaluation of VLMs.

Multimodal OmniFusion Model Outperforms Open-Source Solutions on Visual-Language Benchmarks

The OmniFusion model integrates a pretrained large language model with specialized adapters for processing visual information, enabling superior performance on a range of visual-language benchmarks compared to existing open-source solutions.

VisualWebBench: Evaluating Multimodal Large Language Models' Capabilities in Web Page Understanding and Grounding

VisualWebBench is a comprehensive multimodal benchmark designed to assess the capabilities of Multimodal Large Language Models (MLLMs) in the web domain, covering a variety of tasks such as captioning, webpage QA, OCR, grounding, and reasoning.

Idea-2-3D: Automated 3D Model Generation from Multimodal Inputs

Idea-2-3D is a novel framework that leverages Large Multimodal Models (LMMs) and existing algorithmic tools to automatically generate 3D models from complex multimodal inputs (IDEAs) containing text, images, and 3D models.

Benchmarking Multimodal Foundation Models on Isomorphic Representations

Multimodal foundation models exhibit a consistent preference towards textual representations over visual representations when solving the same problems, in contrast with known human preferences.

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Chat-UniVi empowers large language models to comprehend and engage in conversations involving images and videos through a unified visual representation.

Unified Visual Representation for Large Language Models: Chat-UniVi

Chat-UniVi introduces a unified vision-language model that comprehends and engages in conversations involving images and videos through dynamic visual tokens, outperforming existing methods.

Contextual AD Narration with Interleaved Multimodal Sequence Analysis

Uni-AD proposes a unified framework for Audio Description (AD) generation, leveraging multimodal inputs and contextual information to enhance performance.

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Introducing MiniGPT-5 and its innovative generative vokens approach for improved multimodal generation.

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Griffon v2 introduces a high-resolution multimodal model with visual-language co-referring capabilities, achieving state-of-the-art performance in object detection, counting, REC, phrase grounding, and REG tasks.

About

Products

Resources