Core Concepts
A novel video-based pedestrian attribute recognition framework that efficiently fine-tunes a pre-trained multi-modal foundation model using a spatiotemporal side tuning strategy to capture global visual features and align vision-language information.
Abstract
The proposed framework, termed VTFPAR++, formulates video-based pedestrian attribute recognition as a vision-language fusion problem. It adopts the pre-trained CLIP model as the backbone to extract visual and text features. To efficiently fine-tune the large pre-trained model, a novel spatiotemporal side tuning strategy is introduced.
Specifically, the framework first encodes the input pedestrian video frames using the CLIP vision encoder. Lightweight spatial and temporal side networks are introduced to aggregate multi-scale visual features from different Transformer layers and model temporal relationships across frames, respectively. These spatiotemporal features are then fused with the text features of the attribute descriptions using a multi-modal Transformer. Finally, the enhanced features are fed into an attribute prediction head for the final recognition.
The spatiotemporal side tuning strategy allows the framework to adapt the pre-trained CLIP model to the video-based pedestrian attribute recognition task efficiently, with only a small number of parameters being fine-tuned. Extensive experiments on two large-scale video-based pedestrian attribute recognition datasets demonstrate that the proposed VTFPAR++ outperforms state-of-the-art methods in terms of accuracy, while requiring lower GPU memory consumption and fewer parameter adjustments.
Stats
The video frames can provide more comprehensive visual information for pedestrian attribute recognition compared to a single RGB frame.
Existing video-based pedestrian attribute recognition methods often fail to capture the global relations in the pixel-level space and align the vision-language information well.
Quotes
"The video frames can provide more comprehensive visual information for the specific attribute, but the static image fails to."
"How to design a novel video-based pedestrian attribute recognition framework that simultaneously captures the global features of vision data, and aligns the vision and semantic attribute labels well?"