Główne pojęcia
Enabling VLMs to understand and reason over user-specific concepts enhances human-computer interactions.
Streszczenie
The content introduces MyVLM, a method for personalizing vision-language models. It focuses on personalized image captioning and visual question-answering. The approach involves recognizing user-specific concepts using concept heads and training concept embeddings within the VLM. Results show improved performance in generating personalized captions and answering questions related to specific concepts.
Concepts:
- Introduction of MyVLM for vision-language personalization.
- Methodology involving concept recognition and embedding training.
- Results demonstrating enhanced performance in personalized tasks.
Experiments:
- Dataset creation for evaluating VLM personalization.
- Evaluation metrics include recall, image similarity, and text similarity.
- Comparison with baselines and ablation study on training samples.
Applications:
- Personalized Visual Question-Answering results demonstrate accurate responses to questions about specific concepts.
- Personalized Referring Expression Comprehension showcases the ability to localize target subjects in images without direct supervision.
Statystyki
"This research was performed while Yuval Alaluf was at Snap."
"Large language models (LLMs) have transformed human-computer interaction."
"We apply our technique to BLIP-2 and LLaVA for personalized image captioning."
Cytaty
"A white t-shirt with the words “LOS ANGELES” printed on it"
"On the left side of the image, are sitting at a table with a drink"
"Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs."