MAGID presents an automated pipeline for creating synthetic multi-modal datasets by combining text and images. The framework addresses challenges related to privacy, diversity, and quality in generating conversational data. By incorporating a diffusion model and quality assurance module, MAGID ensures the alignment of text and images, resulting in high-quality multi-modal dialogues. The system utilizes various prompt engineering strategies to optimize the selection of suitable utterances for image augmentation. Additionally, the QA module enhances image-text matching, image quality, and content safety scores to produce relevant and safe images. Human evaluations demonstrate that MAGID outperforms retrieval-based synthetic datasets like MMDD and competes favorably with real datasets like MMDialog and PhotoChat in terms of realism, engagement, image quality, and context matching.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Hossein Abou... at arxiv.org 03-06-2024
https://arxiv.org/pdf/2403.03194.pdfDeeper Inquiries