The paper proposes an end-to-end framework to address the GCE task. The key aspects are:
The input scene description is represented as a scene graph, where each node represents an object and each edge corresponds to the inter-object relationship.
A Scene Graph Enricher, consisting of Graph Convolutional Networks (GCNs), iteratively appends new objects and their relationships to the input scene graph. This enrichment process is guided by a pair of Scene Graph Discriminators that ensure the structural realism and semantic coherence of the enriched graph.
The enriched scene graph is then fed into an Image Synthesizer to generate the final enriched image.
A Visual Scene Characterizer and an Image-Text Aligner are employed to ensure the generated image reflects the essential visual and textual characteristics of the original scene description.
The experiments on the Visual Genome dataset demonstrate that the proposed framework can generate visually plausible images with richer semantic content compared to the state-of-the-art text-to-image generation methods. The authors also conduct an ablation study to highlight the importance of each component in the framework.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania