Embedded Heterogeneous Attention Transformer for Generating Bilingual Image Captions
A single Embedded Heterogeneous Attention Transformer (EHAT) model is designed to simultaneously generate precise and fluent English and Chinese captions from visual information, by leveraging heterogeneous attention mechanisms to establish cross-domain relationships and local correspondences between images and different languages.