Core Concepts
Leveraging text-to-image diffusion models, the proposed D-TIIL method can automatically localize semantic inconsistencies between text and images at both the word and pixel levels.
Abstract
The paper introduces a new method called Diffusion-based Text-Image Inconsistency Localization (D-TIIL) that uses text-to-image diffusion models to expose semantic inconsistencies between text and images.
The key insights are:
Text-to-image diffusion models can act as "omniscient" agents with extensive background knowledge to identify inconsistencies that may not be obvious to humans.
D-TIIL employs a multi-step process to align the semantic content of text and image, filtering out irrelevant information and incorporating background knowledge. This allows it to pinpoint the specific words and image regions that are inconsistent.
The authors also introduce a new dataset called TIIL, which contains 14K text-image pairs with carefully curated inconsistencies at both the word and pixel levels. This enables comprehensive evaluation of text-image inconsistency localization methods.
Experiments show that D-TIIL outperforms previous classification-based approaches in both localization and detection of text-image inconsistencies. The method provides an interpretable and scalable framework for combating online misinformation involving mismatched text and images.
Stats
"A school bus on the New Jersey Turnpike collided with a tractor-trailer Wednesday"
"The 1992 ad featuring the supermodel drinking an orange juice in front of two pubescent boys proved that sex appeal sells products."
"Britain's Queen Diana leaves the annual Braemar Highland Gathering in Braemar Scotland Sept 6 2014"
Quotes
"To the best of our knowledge, it is the first of its kind to feature both pixel-level and word-level inconsistencies, offering fine-grained and reliable inconsistency."
"Text-to-image diffusion models trained on large-scale datasets, such as DALL-E2 (Ramesh et al., 2022), Stable Diffusion (Rombach et al., 2022), Glide (Nichol et al., 2021), and GLIGEN (Li et al., 2023), can generate realistic images with consistent semantic content in the text prompts."