PEELING introduces a text perturbation approach via image-aware property reduction for adversarial testing of the VG model, significantly improving issue detection ability and enhancing model performance.
Integrating subject-level guidance enhances CLIP for improved zero-shot transfer on human-centric tasks.
提案されたPathM3は、組織病理学画像の分類とキャプショニングにおいて効果的な手法であり、限られたテキストデータを活用してモデルの性能を向上させる。
Reasonably observing and improving fine-grained cooperation between modalities enhances multimodal learning.
This paper introduces GenLLaVA, a large multimodal model trained with a novel "generative visual instruction tuning" approach, demonstrating superior performance in visual understanding, generation, and editing tasks compared to previous models by effectively unifying these capabilities within a single architecture.
ARIA, an open-source multimodal native Mixture-of-Experts (MoE) model, achieves state-of-the-art performance in various multimodal, language, and coding tasks, demonstrating its capability to effectively integrate and understand information from different modalities, especially in long-context scenarios.
ROSS, a novel approach to visual instruction tuning, enhances the visual comprehension capabilities of Large Multimodal Models (LMMs) by incorporating a vision-centric reconstructive objective that compels the model to reconstruct input images, thereby improving fine-grained understanding and reducing hallucinations.
Multimodal learning models often overfit to a dominant modality, hindering performance; this paper introduces a Multi-Loss Balanced method to mitigate this issue by dynamically adjusting learning rates based on individual modality performance, leading to improved accuracy across various datasets and fusion techniques.
Multimodal models often suffer from imbalanced learning, where modalities with stronger discriminative abilities dominate the training process, hindering the optimization of other modalities. This paper introduces On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) to mitigate this issue by dynamically controlling the optimization of each modality based on their discriminative discrepancies during training.