Sign In

Feature Completion Transformer for Occluded Person Re-identification

Core Concepts
FCFormer enhances occluded person re-identification by completing missing features in occluded regions, outperforming existing methods.
The content introduces FCFormer, a Feature Completion Transformer for Occluded Person Re-identification. It addresses the challenges of occluded person re-identification by completing missing features in occluded regions. The method includes an Occlusion Instance Augmentation strategy, a Dual Stream Architecture, and a Feature Completion Decoder. Extensive experiments on challenging datasets demonstrate superior performance compared to state-of-the-art methods. Introduction Occluded person re-identification is challenging due to obstacles in camera views. Existing paradigms focus on visible body parts but neglect feature misalignment caused by discarded occlusions. Feature Completion Transformer FCFormer reduces noise interference and complements missing features in occluded parts. It uses Occlusion Instance Augmentation to simulate diverse occlusion situations and aligns occluded-holistic pairs. Dual Stream Architecture Shared ViT backbone extracts general features while unshared transformer layers learn specific patterns for occluded and holistic tasks. Global-local token mixing enhances understanding of occlusion relationships. Feature Completion Stream Feature Completion Decoder recovers features of occluded parts using completion tokens and transformer layers. Training loss involves MSE loss to drive FCD training for feature completion. Overall Training Loss Cross Hard Triplet Loss improves metric learning among different feature modes. Feature Completion Consistency Loss ensures consistency between completed and holistic feature distributions. Experiments Evaluation on five datasets shows FCFormer's competitive performance in both occluded and holistic person re-identification tasks. Performance under Transfer Setting FCFormer demonstrates competitive results when trained on Market-1501 or MSMT17 datasets and evaluated on occluded datasets. Recover from Complement Features Image decoder successfully recovers holistic pedestrian images from completion features, showcasing the effectiveness of feature completion decoder.
"Extensive experiments over five challenging datasets demonstrate that the proposed FCFormer achieves superior performance." "The proposed method could adaptively complement the features in occluded regions." "To validate the effectiveness of our method, we perform experiments on occluded and holistic Re-ID datasets."
"Most existing paradigms focus on visible human body parts through some external models to reduce noise interference." "Different from most previous works that discard the occluded regions, we present Feature Completion Transformer (FCFormer)."

Deeper Inquiries

How does FCFormer handle diverse and realistic occlusion scenarios

FCFormer handles diverse and realistic occlusion scenarios by introducing an Occlusion Instance Library (OIL) and an Occlusion Instance Augmentation strategy (OIA). The OIL contains a variety of common occlusions obtained from different datasets, ensuring the simulation of real-world occlusion scenarios. The OIA strategy randomly selects occlusion samples from the OIL, scales them according to specific ratios, and pastes them onto training images based on strong or weak position priors. This approach allows for the generation of diverse and realistic occluded training image pairs, enhancing the robustness of the model to various types of occlusions.

What are the implications of using a shared ViT backbone with unshared transformer layers

The implications of using a shared ViT backbone with unshared transformer layers in FCFormer are significant. By utilizing a shared ViT backbone, FCFormer can extract general features efficiently while maintaining consistency across both holistic and occluded branches. The unshared transformer layers then allow for specific patterns related to each branch's task—holistic Re-ID or feature completion—to be learned independently. This architecture enables FCFormer to focus on learning discriminative features for both tasks without interference between them.

How does FCFormer compare to other methods that rely on additional models or cues for feature recovery

FCFormer differs from other methods that rely on additional models or cues for feature recovery by adopting a self-supervised paradigm for completing missing features in occluded regions directly from constructed occluded-holistic pairs. Unlike methods that use external models like pose estimators or skeleton information, FCFormer does not require any pre-defined human body regions or additional labels during training. This flexibility makes FCFormer more adaptable to diverse datasets and challenging environments where external cues may not always be reliable or available. Additionally, by incorporating learnable completion tokens within its Feature Completion Decoder, FCFormer can effectively recover missing features without relying on external information sources, showcasing its robustness in handling feature recovery tasks autonomously.