Conceitos Básicos
MLLMs enhance visual-language representation learning by extending diverse captions for each image and addressing issues of hallucinations and monotonous language styles.
Resumo
Visual-language pre-training success relies on large-scale datasets, but noisy pairs affect learning. MLLMs improve image-text associations without extra training cost. Text shearing maintains caption quality. Experiments show significant performance improvements in zero-shot and fine-tuning settings. Single model rewriting has limitations compared to multiple MLLMs. Training factors like batch size, epochs, and caption length impact performance.
Estatísticas
Our method consistently obtains 5.6 ∼ 35.0% and 16.8 ∼ 46.1% improvement on Recall@1 under the fine-tuning and zero-shot settings, respectively.
The average improvement of our method is 13.4 on 15 datasets and 13.1 on ImageNet.
For zero-shot retrieval on MSCOCO, the R@1 of TR and IR increase by 27.2 and 19.4, respectively.
Our method achieves an improvement of 16.8∼23.4 when using BLIP for zero-shot retrieval.
Our method demonstrates an average improvement of 7.9 on linear probing across common datasets.
Citações
"We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning by establishing richer image-text associations for image-text datasets."
"Our approach exhibits the following characteristics: It is compatible with multiple visual-language pre-training frameworks like CLIP and BLIP, demonstrating significant performance improvements across various downstream tasks without introducing additional training overhead."
"Most recent works demonstrate that LLM and MLLM can be used as re-writers to improve caption quality without reducing the number of training pairs."