Visual Pretraining with Location-aware Captioners
LocCa, a simple yet effective visual pretraining method, incorporates location-aware information into the pretraining process to enhance the model's understanding of fine-grained visual details while maintaining competitive performance on holistic image understanding tasks.