The paper proposes LocCa, a visual pretraining method that incorporates location-aware information to enhance the model's understanding of fine-grained visual details. LocCa builds on an image captioner architecture and adds two additional location-aware tasks during pretraining: referring expression and grounded captioning.
The key highlights of the paper are:
LocCa uses a multi-task decoder to handle the standard image captioning task as well as the two location-aware tasks, allowing the model to learn rich information about bounding box coordinates and captions conditioned on the image input.
Experiments show that LocCa outperforms standard captioners significantly on localization downstream tasks, such as referring expression comprehension and referring expression segmentation, while maintaining comparable performance on holistic tasks like image classification and captioning.
When integrated into a large language model (PaLI-3), LocCa's vision encoder outperforms strong SigLIP baselines across a variety of vision-language tasks, demonstrating the effectiveness of the location-aware pretraining.
LocCa exhibits strong zero-shot detection capabilities, though the decoding strategy to output high-quality bounding boxes and labels simultaneously remains an open challenge.
Ablation studies confirm the importance of the location-aware tasks (referring expression and grounded captioning) in improving LocCa's performance on fine-grained visual understanding.
To Another Language
from source content
arxiv.org
Глибші Запити