מושגי ליבה
A single Embedded Heterogeneous Attention Transformer (EHAT) model is designed to simultaneously generate precise and fluent English and Chinese captions from visual information, by leveraging heterogeneous attention mechanisms to establish cross-domain relationships and local correspondences between images and different languages.
תקציר
The article proposes the Embedded Heterogeneous Attention Transformer (EHAT) model for cross-lingual image captioning. The key highlights are:
EHAT comprises three components - Masked Heterogeneous Cross-attention (MHCA), Heterogeneous Attention Reasoning Network (HARN), and Heterogeneous Co-attention (HCA) - to establish cross-domain relationships and local correspondences between images and different languages.
MHCA aligns the dimensional space between image region features and language embeddings. HARN, as the core of EHAT, aligns semantic information between each image-language pair through cross-attention and provides heterogeneous similarity weights to connect English and Chinese, anchored by the visual context. HCA handles the final language representations sent to the generator and facilitates cross-lingual interaction.
Two variants of HARN are explored to investigate the impact of language interactions on the heterogeneous attention structure.
Experiments on the MSCOCO dataset demonstrate that EHAT outperforms advanced monolingual image captioning methods in generating both English and Chinese captions simultaneously, effectively addressing the challenges of cross-lingual image captioning.
The proposed EHAT framework paves the way for improved multilingual image analysis and understanding.
סטטיסטיקה
There are on average 5 pairs of English and Chinese captions for each image in the MSCOCO dataset.
The English vocabulary consists of 9,487 words and the Chinese vocabulary consists of 9,532 words.
The top 10 most frequently occurring English words include 'sitting', 'standing', 'white', 'people', 'women', 'holding', 'person', 'top', 'front', and 'table'.
The top 10 most frequently occurring Chinese words include '坐在', '站', '男人', '旁边', '桌子', '女人', '白色', '男子', '穿着', and '放在'.
ציטוטים
"Our objective is to design a transformer-based cross-modal and cross-lingual alignment module for efficient cross-lingual image captioning."
"We demonstrate that such heterogeneous attention is instrumental in generating precise and fluent English-Chinese captions simultaneously from visual information."
"Notably, our method represents the first application of heterogeneous attention embedded into a transformer decoder for cross-lingual image captioning in a single ensemble structure, effectively capturing both global and local features."