Lan, Z., Niu, L., Meng, F., Li, W., Zhou, J., & Su, J. (2024). AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity. arXiv, abs/2410.02745.
This research paper introduces AVG-LLaVA, a novel multimodal large language model (MLLM) designed to address the computational challenges posed by high-resolution images in existing MLLMs. The study aims to improve the efficiency and performance of MLLMs by adaptively selecting the appropriate visual granularity for image processing based on the input image and instruction.
The researchers developed AVG-LLaVA by extending the LLaVA-NeXT architecture with two key modules: a visual granularity scaler and a visual granularity router. The visual granularity scaler generates multi-granularity visual features by applying multiple rounds of pooling on visual tokens. The visual granularity router, comprising a Transformer layer, an MLP layer, and a voter layer, then selects the most appropriate visual granularity based on the input image and instruction. The model was trained using a multi-stage training strategy, including pretraining on image-caption pairs, visual instruction tuning, multi-granularity visual instruction tuning, and a novel Ranking Granularity to Align LMM Feedback (RGLF) paradigm.
AVG-LLaVA demonstrated superior performance compared to other state-of-the-art MLLMs on 11 benchmarks, including general VQA, text-oriented VQA, and general multimodal tasks. The model achieved significant reductions in the number of visual tokens and improved inference speed, particularly in tasks that do not require fine-grained visual information. For instance, on the AI2D benchmark, AVG-LLaVA achieved a remarkable 85.3% reduction in visual tokens and a 2.53× increase in inference speed compared to LLaVA-NeXT.
The study highlights the importance of adaptive visual granularity in enhancing the efficiency and performance of MLLMs. By dynamically adjusting the level of visual detail processed based on the task and input, AVG-LLaVA effectively reduces computational costs without sacrificing accuracy. The proposed RGLF training paradigm proves effective in guiding the model to select the most appropriate visual granularity based on LLM feedback.
This research significantly contributes to the field of MLLMs by addressing the computational bottlenecks associated with high-resolution image processing. The proposed AVG-LLaVA model and RGLF training paradigm offer a promising avenue for developing more efficient and scalable MLLMs capable of handling increasingly complex multimodal tasks.
While AVG-LLaVA shows promising results, the authors acknowledge limitations in handling text-intensive benchmarks where the model tends to select the finest-grained visual tokens. Future research could explore more sophisticated granularity scaling networks to provide a wider range of visual granularities and further optimize performance on such tasks.
Ke Bahasa Lain
dari konten sumber
arxiv.org
Wawasan Utama Disaring Dari
by Zhibin Lan, ... pada arxiv.org 10-04-2024
https://arxiv.org/pdf/2410.02745.pdfPertanyaan yang Lebih Dalam