Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
The authors introduce TRIPS, an efficient VLP approach that reduces visual sequence length using text-guided patch selection, improving training and inference processes without adding parameters.