Core Concepts
Integrating subject-level guidance enhances CLIP for improved zero-shot transfer on human-centric tasks.
Stats
FocusCLIP achieved an average accuracy of 33.65% compared to 25.04% by CLIP.
The proposed approach surpassed CLIP by an average of 8.61% across five unseen datasets.
Improvements observed in activity recognition, age classification, and emotion recognition.
Quotes
"Our novel contributions enhance CLIP on both the vision and text sides."
"Using our proposed single-shot LLM prompting strategy, we release a high-quality MPII Pose Descriptions dataset."