Core Concepts
The proposed method, Personalized Multimodal Generation (PMG), leverages large language models to extract user preferences from historical behaviors and generates personalized multimodal content by conditioning a generator, such as a multimodal LLM or diffusion model, on the extracted preferences.
Abstract
The paper proposes a method called Personalized Multimodal Generation (PMG) that leverages large language models (LLMs) to enable personalized multimodal generation. The key aspects of the method are:
Extracting user preferences: PMG first converts user behaviors, such as clicks in recommender systems or past conversations, into natural language to facilitate LLM understanding. It then extracts user preference descriptions using the LLM.
Representing user preferences: To capture user preferences comprehensively and accurately, PMG proposes to let the LLM output a combination of explicit keywords and implicit embeddings to represent user preferences.
Conditioning the generator: The combination of keywords and embeddings are used as prompts to condition the multimodal generator, such as a diffusion model or a multimodal LLM. PMG optimizes a weighted sum of the accuracy score (consistency with the target item) and the preference score (alignment with user preferences) to balance the generation.
The experiments demonstrate that PMG can generate personalized images, movie posters, and emoticons that effectively combine user preferences and target item characteristics. Compared to a baseline without personalization, PMG achieves significant improvements in personalization while retaining generation accuracy.