The paper introduces MLLMReID, a novel approach for person re-identification (ReID) using multimodal large language models (MLLM). It addresses two key challenges in adapting MLLMs for ReID tasks:
Designing instructions for ReID: The paper proposes Common Instruction, a simple continuation instruction that enables both text and image inputs to produce identical continuation texts. This avoids the issues of complex and diverse instruction designs, which can lead to overfitting.
Utilizing latent image features from LLMs: The paper introduces DirectReID, which directly applies the latent image feature vectors output by the LLM to the ReID task, optimizing the visual encoder through the ReID-specific loss functions (ID Loss and Triplet Loss). This enhances the visual encoder's ability to extract distinctive person features.
The experimental results demonstrate the superiority of the proposed MLLMReID approach compared to other state-of-the-art methods, particularly on large-scale datasets like MSMT17. The authors attribute this improvement to the effectiveness of the Common Instruction in preserving the LLM's diversity and the DirectReID module in fully leveraging the LLM's latent image features for person feature learning.
The paper also includes a case study that illustrates how the combination of Common Instruction and DirectReID enables the model to better differentiate between persons with similar appearances, leading to improved retrieval accuracy.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Shan Yang,Yo... at arxiv.org 04-04-2024
https://arxiv.org/pdf/2401.13201.pdfDeeper Inquiries