비전 모델과 대규모 언어 모델(LLM)을 단일 선형 레이어를 통해 효과적으로 연결하여 동결된 LLM이 시각적 세계를 이해할 수 있도록 한다.
Detect2Interact introduces an advanced approach for fine-grained object visual key field detection, enabling VQA systems to provide more contextually relevant and spatially accurate responses by leveraging segmentation, captioning, and large language model knowledge.
Leveraging the structural similarity between visual and textual triplets, comprising subject, predicate, and object, to accurately link textual descriptions to their corresponding image regions in a zero-shot setting.