Khái niệm cốt lõi
CLIP's accuracy is significantly lower on images from blind/low vision users compared to web-crawled images due to sensitivities to image content, quality, and text content.
Tóm tắt
The study evaluates CLIP's performance disparities on data from blind/low vision (BLV) users. It systematically assesses 25 CLIP variants in a zero-shot classification task, revealing a 15 percentage point accuracy gap between BLV and web-crawled images. The disparities stem from issues with image content recognition, quality robustness, and textual content understanding. Three common pre-training datasets are analyzed for disability content representation. Performance gaps extend to downstream models like OWL-ViT, CLIPSeg, and DALL-E2. Mitigation strategies include few-shot learning and application-level solutions.
Structure:
Introduction: Potential of AI for BLV assistance.
Data Disparities: Lower accuracy of CLIP on BLV data.
Image Content Sensitivity: Recognition challenges for disability objects.
Image Quality Impact: Effects of atypical framing, blur, viewpoint issues.
Textual Content Analysis: Recognition differences based on color vs material descriptions.
Downstream Model Impact: Performance disparities in OWL-ViT and DALL-E2.
Mitigation Strategies: Few-shot learning and application-level solutions.
Conclusion: Call for transparency in dataset reporting and equitable LMM development.
Thống kê
Testing 25 CLIP variants shows a 15% lower accuracy on BLV user images than web-crawled ones.
Disability objects have 25% lower recognition accuracy compared to non-disability objects with CLIP variants.
Large-scale datasets mention disability objects 16-17x less frequently than non-disability objects in captions.
Trích dẫn
"We find that few-shot learning with as few as 5 images can mitigate CLIP’s quality-of-service disparities for BLV users."
"Disability objects are recognized less accurately by CLIP compared to non-disability objects."