Core Concepts
The proposed Multi-MaP method leverages multi-modal models and large language models to capture a user's specific interest and discover personalized clustering structures hidden in visual data.
Abstract
The paper proposes a novel multi-modal proxy learning method called Multi-MaP to address the challenge in multiple clustering where users often do not need all the clustering results generated by the algorithm.
Key highlights:
- Multi-MaP integrates a user's high-level concept (e.g., color, species) using textual prompts to trigger the corresponding feature extraction from pre-trained CLIP encoders.
- It introduces reference word constraint and concept-level constraint to learn the optimal text proxy according to the user's interest, overcoming the challenge of learning a word proxy in a continuous space.
- Multi-MaP leverages GPT-4 to generate candidate reference words based on the user's high-level concept, which helps further constrain the proxy learning.
- Experiments on multiple public visual clustering datasets show that Multi-MaP consistently outperforms state-of-the-art methods in capturing the user's interest and generating personalized clustering results.
- The paper also demonstrates that CLIP can uncover different semantic aspects of images, which is a novel finding.
Stats
Common colors of fruit include red, yellow, green, orange, purple, and blue.
The Fruit360 dataset contains 4,856 samples with 4 color clusters and 4 species clusters.
The Flowers dataset contains 1,600 samples with 4 color clusters and 4 species clusters.
Quotes
"Given only a high-level concept from the user, it is infeasible to fine-tune the pre-trained models to capture a specific aspect of the data, without the detailed labels corresponding to the user's concept."
"Fortunately, given CLIP's ability to model image-text pairs collaboratively, we can use a user's high-level concept to trigger the corresponding feature extraction from the pre-trained encoders from CLIP."