Core Concepts
The author proposes a method to classify and alleviate artifacts in synthetic images using a Vision-Language Model, resulting in improved image quality.
Abstract
In the rapidly evolving field of image synthesis, the presence of complex artifacts compromises the perceptual realism of synthetic images. To address this challenge, the author fine-tunes a Vision-Language Model (VLM) as an artifact classifier to automatically identify and classify a wide range of artifacts. By developing a comprehensive artifact taxonomy and constructing a dataset named SynArtifact-1K, the fine-tuned VLM demonstrates superior ability in identifying artifacts and outperforms the baseline by 25.66%. The output of VLM is leveraged as feedback to refine generative models for alleviating artifacts. The study showcases visualization results and user studies that confirm the improved quality of images synthesized by the refined diffusion model.
Stats
To alleviate artifacts and improve quality, fine-tuned VLM outperforms baseline by 25.66%.
SynArtifact-1K dataset contains 1.3k annotated images with artifact categories, captions, and coordinates.
LLaVA fine-tuned on SynArtifact-1K achieves an accuracy of 45.66% for artifact classification.
Weight initialization impacts performance, with Stage 1 showing better results compared to Stage 2.
RLAIF strategy effectively improves synthetic image quality through artifact classification reward.
Quotes
"No artifacts" - Example answer template for reference.
"Awkward facial expression." - Example answer example for artifact classification.