텍스트 기반 학습만으로 시각적 작업을 수행하는 모델의 성능을 향상시키기 위해, 본 논문에서는 적응형 코사인 유사도 기반 노이즈 주입 기법인 ArcSin을 제안하며, 다양한 시각적 작업에서 기존 방법 대비 우수한 성능을 보입니다.
This paper introduces ReVisionLLM, a novel recursive vision-language model designed to overcome the limitations of existing VLMs in processing hour-long videos for temporal grounding tasks.
본 논문에서는 사전 훈련된 Vision-Language Models (VLMs)을 활용하여 컨텍스트 내에서 제공된 제한된 수의 예시 이미지를 기반으로 특정 객체를 현지화하는 능력을 향상시키는 데이터 중심 접근 방식을 제안합니다.
Vision-Language Models (VLMs) can be trained to perform personalized object localization by fine-tuning them on carefully curated data from video object tracking datasets, enhancing their ability to localize specific object instances based on in-context examples.
Vision-language models (VLMs) excel at object recognition but struggle to understand sequential tasks, highlighting a critical limitation in their ability to function as reliable task supervisors for complex, multi-step activities.
Vision-Language Models (VLMs) exhibit distinct value preferences, often aligning with mainstream values like Hedonism, and these preferences can be systematically adjusted to induce specific personas through targeted role-playing strategies.
SOLO, a vision-language model built on a single Transformer architecture, offers advantages in scalability and efficiency compared to heterogeneous architectures while achieving comparable performance.
BLIP3-KALE leverages a two-stage approach, combining synthetic captions from large vision-language models with factual information from web-scale alt-text, to create a large-scale dataset of 218 million knowledge-augmented image-text pairs for training more capable and knowledgeable multimodal models.
현대 VLM의 구성적 추론 능력을 정확하게 평가하기 위해서는 기존 벤치마크의 한계를 극복하고 이미지와 텍스트 맥락을 모두 고려한 새로운 벤치마크 및 평가 지표가 필요하다.
Modern Vision-Language Models (VLMs) still struggle with Compositional Reasoning (CR), and existing benchmarks fail to adequately challenge their capabilities. ConMe, a new benchmark with a novel VLM-based data generation pipeline, addresses this issue by creating harder and more realistic CR questions, revealing significant performance drops in state-of-the-art VLMs.