Multimodal Large Language Models (MLLMs) have emerged as a promising approach to achieving artificial general intelligence by leveraging the power of large language models and multimodal reasoning. This survey provides a comprehensive overview of the recent progress in MLLMs, including key techniques such as Multimodal Instruction Tuning, Multimodal In-Context Learning, Multimodal Chain of Thought, and LLM-Aided Visual Reasoning.
AVG-LLaVA enhances the efficiency and performance of multimodal large language models (MLLMs) by adaptively selecting the appropriate visual granularity for image processing based on the input image and instruction, thereby reducing the number of visual tokens required and speeding up inference without compromising accuracy.
LongLLaVA는 맘바(Mamba)와 트랜스포머(Transformer) 블록을 결합한 하이브리드 아키텍처를 통해 멀티모달 LLM의 장문 컨텍스트 이해 능력과 효율성을 향상시킨 모델이다.
This paper introduces ERRORRADAR, a novel benchmark designed to evaluate the complex mathematical reasoning capabilities of Multimodal Large Language Models (MLLMs) by assessing their proficiency in detecting and categorizing errors in student-provided solutions to mathematical problems.
Frozen large language models (LLMs) can effectively generalize to multimodal inputs due to an implicit multimodal alignment (IMA) driven by their architectural design, specifically the interplay between residual streams and refinement blocks, enabling them to process diverse data types like images, videos, and audio alongside text.
This paper proposes a tutorial for the ACM Multimedia 2024 conference focusing on the recent advancements in multimodal pretrained and large models, particularly their ability to integrate and process diverse data forms like text, images, audio, and video.
The selection of connectors in Multimodal Large Language Models (MLLMs) significantly impacts performance, with feature-preserving connectors excelling in fine-grained perception tasks and feature-compressing connectors offering speed advantages in coarse-grained perception and reasoning tasks.
Existing Multimodal Large Language Models (MLLMs) struggle with nuanced human-centric understanding due to limitations in training data, and specialized benchmarks and datasets like HERM are crucial for driving progress in this area.
既存のマルチモーダル大規模言語モデル (MLLM) は、人間中心の視覚的理解に必要な、詳細で多面的な人間に関する注釈が不足しているため、人間中心の複雑なシナリオの理解に限界がある。
기존 멀티모달 대규모 언어 모델(MLLM)은 인간 중심적 이해를 위한 세분화된 주석이 부족한 데이터로 학습되어 복잡한 인간 중심적 시나리오에서 성능이 제한적이며, 이를 해결하기 위해 다단계 인간 중심적 주석을 갖춘 HERM-100K 데이터셋과 벤치마크 HERM-Bench를 제안한다.