Concepts de base
By identifying and leveraging "visual anchors" – key points of visual information aggregation within image data – the Anchor Former (AcFormer) offers a more efficient and accurate approach to connecting visual data with large language models.
Liu, H., You, Q., Han, X., Liu, Y., Huang, H., He, R., & Yang, H. (2024). Visual Anchors Are Strong Information Aggregators for Multimodal Large Language Model. Advances in Neural Information Processing Systems, 38.
This research paper introduces a novel vision-language connector called Anchor Former (AcFormer) designed to enhance the accuracy and efficiency of Multimodal Large Language Models (MLLMs). The authors aim to address the limitations of existing vision-language connectors, particularly the high computational cost associated with processing large numbers of visual tokens and the potential for information loss when using fixed, learnable queries for information aggregation.