The authors introduce the first fair vision-language medical dataset (FairVLMed) that provides detailed demographic attributes, ground-truth labels, and clinical notes to facilitate an in-depth examination of fairness within vision-language (VL) foundation models. Using FairVLMed, they conduct a comprehensive fairness analysis of two widely-used VL models (CLIP and BLIP2), pre-trained on both natural and medical domains, across four different protected attributes (race, gender, ethnicity, and language).
The results highlight significant biases in all VL models, with Asian, Male, Non-Hispanic, and Spanish being the preferred subgroups across the protected attributes. Medical pre-training enhances the performance-fairness trade-off across all attributes except language, and different VL pre-training methods exhibit varying strengths, with CLIP outperforming on race and gender, whereas BLIP2 yields superior results on ethnicity and language.
To address these fairness issues, the authors propose FairCLIP, an optimal transport-based approach that achieves a favorable trade-off between performance and fairness by reducing the Sinkhorn distance between the overall sample distribution and the distributions corresponding to each demographic group. Extensive analyses demonstrate the effectiveness of FairCLIP in improving fairness across various protected attributes compared to the standard CLIP model.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문