This paper proposes a self-supervised visual preference alignment (SeVa) pipeline that can efficiently generate preference data without relying on supervision from GPT-4 or human annotations. The key idea is to leverage commonly used image augmentations to induce vision-language models to generate false but hard negative responses, which can then be used to train the models through direct preference optimization (DPO).
The authors first observe that vision-language models are sensitive to data augmentations, and the original and distorted responses can be used to construct valuable preference data. They then mathematically show that SeVa is closely related to visual contrastive learning, where the augmented views of the same image are treated as negatives.
Experiments on various multi-modal benchmarks demonstrate the effectiveness of SeVa, where it consistently improves the performance of LLaVA-1.5-7B/13B models by a large margin, e.g., a 6.7%/5.6% score increase on the complex MM-Vet benchmark. Detailed ablations and visualizations reveal that SeVa can enhance the models' abilities in areas such as detailed descriptions, reduced hallucinations, stronger OCR skills, and improved chain-of-thought reasoning.
The authors highlight that SeVa is an efficient and simple pipeline that does not require any GPT-4 or costly human-annotated data, paving the way for future preference alignment in the visual-language domain.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問