Identifying Unreliable Responses from Black-Box Vision-Language Models Using Neighborhood Consistency
Consistency over rephrasings of a visual question can be used to identify unreliable predictions from a black-box vision-language model, even when the rephrasing model is substantially smaller than the black-box model.