Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
The author proposes a SemMIM framework to enhance cross-modal semantic alignment by injecting high-level semantics into local patch encodings and involving text deeply in the MIM process.