Core Concepts
MoMA is an open-vocabulary, tuning-free personalized image generation model that leverages a multimodal large language model (MLLM) to effectively blend text prompts with visual features of a reference image, enabling flexible zero-shot capabilities for both recontextualization and texture editing.
Abstract
The paper introduces MoMA, a novel image personalization model that excels in detail fidelity, object identity resemblance, and coherent textual prompt integration. MoMA harnesses the capabilities of Multimodal Large Language Models (MLLMs) to seamlessly blend text prompts with visual features of a reference image, enabling alterations in both the background context and object texture.
Key highlights:
MoMA utilizes a generative multimodal image-feature decoder to extract and edit image features based on the target text prompt, effectively synergizing reference image and text information.
A self-attention feature transfer mechanism with a masking procedure is introduced to enhance the detail quality of generated images.
MoMA achieves superior performance in both recontextualization and texture editing tasks without any per-instance tuning, demonstrating its flexibility and efficiency.
Extensive experiments and comparisons show MoMA outperforms existing tuning-free open-vocabulary personalization approaches in detail fidelity, identity preservation, and prompt faithfulness.
MoMA can be directly applied to various community-tuned diffusion models, showcasing its versatility as a plug-and-play module.