Core Concepts
Efficiently reduce redundancy in visual instruction datasets for improved performance in multimodal large language models.
Abstract
Visual instruction tuning is crucial for enhancing multimodal large language models (MLLMs). Existing MLLMs rely on diverse visual instruction datasets, leading to data redundancy. A new approach, TIVE, estimates task and instance values to select representative instances, achieving comparable performance with only 7.5% of the data. Experimental results show significant improvements across benchmarks.
Stats
"Using only about 7.5% data can achieve comparable performance as the full-data fine-tuned model."
"Our approach even surpasses the full-data model on four benchmarks."
"With only 16.7% of the data from Vision-Flan dataset, our approach achieves 95% performance compared to the original model."