Core Concepts
The authors introduce TRIPS, an efficient VLP approach that reduces visual sequence length using text-guided patch selection, improving training and inference processes without adding parameters.
Abstract
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection introduces TRIPS, a method to reduce computational inefficiencies in VLP models. By dynamically selecting image patches based on text guidance, TRIPS improves efficiency without compromising performance across various downstream tasks.
The study addresses the challenges of lengthy visual sequences in VLP models by introducing TRIPS, which optimizes the selection of image tokens. This approach accelerates training and inference processes while maintaining competitive or superior performance on multiple benchmarks.
TRIPS does not add extra parameters but significantly enhances model efficiency by reducing redundant image tokens through text-aware patch selection. The results demonstrate a 40% speedup in processing while achieving comparable or better results on VQA, NLVR, cross-modal retrieval, image captioning, and visual grounding tasks.
By incorporating TRIPS into existing VLP models like ALBEF and mPLUG, the study showcases the potential for significant efficiency gains without sacrificing task performance. The method's flexibility allows for fine-tuning on higher resolution images to further enhance model capabilities.
Stats
TRIPS delivers a 40% speedup.
ViLT equipped with TRIPS scores 71.48 on Test-dev.
ALBEF with TRIPS achieves 76.23 on VQA test-dev.
TRIPS-mPLUG reaches 80.11 accuracy on RefCOCO+ testA.
Quotes
"No additional parameters are introduced."
"TRIPS progressively reduces redundant image tokens."
"TRIPS maintains competitive or superior performance."