RACC, a framework that learns to compress and aggregate retrieved contexts, achieves state-of-the-art performance on knowledge-based visual question answering tasks while significantly reducing inference latency and storage requirements.
LIME-M is a comprehensive benchmark that effectively evaluates the performance of Multimodal Large Language Models (MLLMs) by filtering out low-quality and easy samples, and focusing on challenging tasks that require deeper image understanding and reasoning.
VITA is an open-source multimodal large language model that can simultaneously process and analyze video, image, text, and audio modalities, while also featuring advanced multimodal human-computer interaction capabilities.
This paper introduces a novel training-free framework called Recaption, Plan and Generate (RPG) that leverages the powerful reasoning abilities of multimodal large language models (MLLMs) to enhance the compositionality and controllability of text-to-image diffusion models.
Morph-tokens, which transform pre-MLLM visual tokens into non-conflicting post-MLLM visual tokens, enable multimodal large language models to achieve synergy between visual comprehension and generation tasks.
This thesis aims to advance the theoretical and computational foundations of multimodal machine learning, enabling the creation of next-generation multimodal technologies that can learn from and reason about multiple sensory inputs.
A framework that automatically generates sound effects and background music semantically consistent with the content of a given video, using a multimodal language model to understand the video and guide the audio generation.
Cantor, a novel multimodal chain-of-thought framework, effectively integrates visual context and logical reasoning to solve complex visual reasoning tasks by leveraging the advanced cognitive capabilities of multimodal large language models.
Multimodal Large Language Models (MLLMs) are evaluated on a comprehensive benchmark, DesignProbe, to assess their capabilities in understanding graphic design across both fine-grained design elements and overall design concepts.
MoVA, a powerful multimodal large language model, adaptively routes and fuses task-specific vision experts with a coarse-to-fine mechanism to enhance generalization across diverse image content.