LexVLA, a novel framework for vision-language alignment, leverages pre-trained uni-modal models and a unified lexical representation to achieve superior performance and interpretability in cross-modal retrieval tasks, surpassing methods reliant on large multi-modal datasets and complex training schemes.
大規模言語モデル (LLM) と事前学習済み視覚モデルの整列の課題を解決するために、ファイングレイン検証器を用いた新しい自己整列手法、FiSAO (Fine-Grained Self-Alignment Optimization) が提案されている。
Vision-Language Large Models (VLLMs) can be significantly improved by using fine-grained, token-level feedback from the model's own visual encoder to optimize alignment between visual and linguistic modalities, eliminating the need for external data or reward models.