Core Concepts
Integrating subject-level guidance enhances CLIP for improved zero-shot transfer on human-centric tasks.
Abstract
FocusCLIP integrates subject-level guidance into the CLIP framework.
Enhancements on both vision and text sides improve performance.
Trained with MPII Human Pose dataset, surpassing CLIP across various tasks.
Release of MPII Pose Descriptions dataset to encourage further research.
Subject-level supervision benefits non-human-centric tasks as well.
Stats
FocusCLIP achieved an average accuracy of 33.65% compared to 25.04% by CLIP.
The proposed approach surpassed CLIP by an average of 8.61% across five unseen datasets.
Improvements observed in activity recognition, age classification, and emotion recognition.
Quotes
"Our novel contributions enhance CLIP on both the vision and text sides."
"Using our proposed single-shot LLM prompting strategy, we release a high-quality MPII Pose Descriptions dataset."