Centrala begrepp
LexVLA, a novel framework for vision-language alignment, leverages pre-trained uni-modal models and a unified lexical representation to achieve superior performance and interpretability in cross-modal retrieval tasks, surpassing methods reliant on large multi-modal datasets and complex training schemes.
Sammanfattning
Bibliographic Information:
Li, Y., Wang, Y., Fu, Y., Ru, D., Zhang, Z., & He, T. (2024). Unified Lexical Representation for Interpretable Visual-Language Alignment. Advances in Neural Information Processing Systems, 38.
Research Objective:
This paper introduces LexVLA, a novel framework designed to enhance vision-language alignment (VLA) by learning a unified lexical representation for both visual and textual modalities. The authors aim to address the limitations of existing VLA models, particularly the interpretability issues associated with latent feature alignment and the complexities of training lexical representations.
Methodology:
LexVLA leverages pre-trained uni-modal models, specifically DINOv2 for its local-inclined visual features and Llama 2 for its in-context lexical prediction capabilities. The framework employs distinct codebooks for each modality to maintain the strengths of the pre-trained models while learning a shared lexical vocabulary. An "overuse penalty" is introduced to encourage sparsity and prevent the over-activation of irrelevant tokens during training. The model undergoes incremental fine-tuning using a contrastive learning objective.
Key Findings:
- LexVLA demonstrates superior performance in zero-shot cross-modal retrieval tasks on Flickr30k and MSCOCO datasets, outperforming baselines trained on significantly larger multi-modal datasets.
- The use of pre-trained uni-modal models and distinct codebooks proves effective in achieving strong alignment with less multi-modal training data.
- The proposed "overuse penalty" successfully encourages sparsity in the lexical representation while mitigating the activation of semantically irrelevant tokens, leading to improved interpretability.
- A novel metric, PatchDis, is introduced to evaluate patch-level interpretability, demonstrating LexVLA's ability to learn fine-grained visual representations.
Main Conclusions:
LexVLA presents a compelling approach to VLA by combining the strengths of pre-trained uni-modal models with a unified lexical representation. The framework's efficiency in learning from limited multi-modal data, coupled with its enhanced interpretability, positions it as a valuable contribution to the field.
Significance:
This research significantly advances the field of VLA by offering a more interpretable and data-efficient approach to aligning visual and textual information. The proposed framework and the PatchDis metric have the potential to influence future research in cross-modal retrieval and understanding.
Limitations and Future Research:
The study acknowledges the limitations of the current vocabulary derived from the language model's tokenizer, which may not perfectly represent word-level semantics. Future research could explore the development of a dedicated word-level vocabulary that leverages the strengths of LLMs while addressing the limitations of sub-word tokenization.
Statistik
LexVLA is trained on the CC-12M dataset, utilizing 9.2 million out of the full 12.4 million image-text pairs.
The model achieves a sparsity ratio of 98.27%, activating only 296 tokens on average, compared to CLIP's 512 dimensions, while still achieving superior performance.
LexVLA's total trainable parameters are 109 million, with 70 million for the vision codebook, 17 million for the vision projector (19.76% of DINOv2), and 21 million for the Llama adaptor (0.30% of Llama 2).
Citat
"On the other hand, the lexical representation is known for its clarity as each dimension corresponds to the similarity between the input and a specific word/token from the vocabulary."
"However, learning lexical representation is difficult. The embedding vector is much larger than the CLIP latent vector as the vocabulary size is usually much larger than CLIP’s feature dimension."
"In this paper, we propose LexVLA, a simple yet comprehensive framework for learning a unified lexical representation in the CLIP-style contrastive training pipeline, facilitating the VLA."