MoMA: An Open-Vocabulary, Tuning-Free Multimodal LLM Adapter for Personalized Image Generation
MoMA is an open-vocabulary, tuning-free personalized image generation model that leverages a multimodal large language model (MLLM) to effectively blend text prompts with visual features of a reference image, enabling flexible zero-shot capabilities for both recontextualization and texture editing.